如何从a到z遍历列表 - 抓取数据并转换为数据框？

Question

我现在正在做一个抓取工具，目的是收集德国保险公司的数据。我们有一个从A到Z的保险公司数据列表。

我们的成员：

https://www.gdv.de/gdv/der-gdv/unsere-mitglieder 这里有478个结果的概览：

关于A字母： https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A 关于B字母： https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B

依此类推：顺便提一下，看看某个公司的页面示例：
https://www.gdv.de/gdv/der-gdv/unsere-mitglieder/ba-die-bayerische-allgemeine-versicherung-ag-47236

我们需要获取这些公司的联系数据和地址。

我觉得这个任务可以用一个小的bs4抓取工具来完成，结合requests库，把所有数据放到一个数据框里：我使用BeautifulSoup来解析HTML，用Requests来发送HTTP请求。最好的方法，我想就是用BeautifulSoup和Requests从给定的URL中提取联系数据和地址（见上面和下面的链接）。

首先，我们需要定义一个函数scrape_insurance_company，这个函数接收一个URL作为输入，然后发送一个HTTP GET请求，并使用BeautifulSoup提取联系数据和地址。

最后，我们需要返回一个包含提取数据的字典。因为我们需要覆盖从A到Z的所有字母，所以我们需要在这里进行迭代：我们会遍历一个包含保险公司的URL列表，并对每个URL调用这个函数来收集数据。随后，我们使用Pandas将数据整理成一个数据框。

注意：我在Google Colab上运行这个：

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_insurance_company(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all the links to insurance companies
        company_links = soup.find_all('a', class_='entry-title')
        
        # List to store the data for all insurance companies
        all_data = []
        
        # Iterate through each company link
        for link in company_links:
            company_url = link['href']
            company_data = scrape_company_data(company_url)
            if company_data:
                all_data.append(company_data)
        
        return all_data
    else:
        print("Failed to fetch the page:", response.status_code)
        return None

def scrape_company_data(url):
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # DEBUG: Print HTML content of the page
        print(soup.prettify())
        
        # Find the relevant elements containing contact data and address
        contact_info = soup.find('div', class_='contact')
        address_info = soup.find('div', class_='address')
        
        # Extract contact data and address if found
        contact_data = contact_info.text.strip() if contact_info else None
        address = address_info.text.strip() if address_info else None
        
        return {'Contact Data': contact_data, 'Address': address}
    else:
        print("Failed to fetch the page:", response.status_code)
        return None

# now we list to store data for all insurance companies
all_insurance_data = []

# and now we iterate through the alphabet
for letter in range(ord('A'), ord('Z') + 1):
    letter_url = f"https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter={chr(letter)}"
    print("Scraping page:", letter_url)
    data = scrape_insurance_company(letter_url)
    if data:
        all_insurance_data.extend(data)

# subsequently we convert the data to a Pandas DataFrame
df = pd.DataFrame(all_insurance_data)

# and finally - we save the data to a CSV file
df.to_csv('insurance_data.csv', index=False)

print("Scraping completed and data saved to 'insurance_data.csv'.")

目前一切看起来是这样的 - 我在Google Colab的终端中看到：

保险信息：

Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=A
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=B
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=C
Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=D

Scraping page: https://www.gdv.de/gdv/der-gdv/unsere-mitglieder?letter=Z

Scraping completed and data saved to 'insurance_data.csv'.

但是列表还是空的……我在这里还有点挣扎。

http请求数据提取 html解析数据抓取 pandas 数据框保险公司迭代遍历

如何从a到z遍历列表 - 抓取数据并转换为数据框？

1 个回答

撰写回答