BeautifulSoup：对24个字符（a到z）迭代失败：简化复杂性以初步了解数据集

Question

我在西班牙有一个保险公司列表，这些信息分成了24个部分，收集在一个网站上。你可以查看这个链接：

保险公司 - 西班牙语：完整列表：https://www.unespa.es/en/directory

这个列表分成了24页： https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z

我的想法是：我想用BS4和requests从这些页面获取数据，最后把它保存到一个数据框中：我觉得用Python的BeautifulSoup（BS4）和requests来抓取网站上的列表是合适的；我认为我们需要按照以下步骤进行：

a. 首先，我们需要导入必要的库：BeautifulSoup、requests和pandas。 b. 然后，我们需要使用requests库获取每个感兴趣页面的HTML内容，也就是从A到Z的页面。 c. 接下来，我会用BeautifulSoup来解析HTML内容。 d. 然后，我认为从解析后的HTML中提取相关信息（保险公司的名称）是下一步。 e. 最后，我想把提取的数据存储到一个pandas数据框中。

但是这样做不成功……从A到Z的迭代也不行：

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/"

# List to store all insurers
all_insurers = []

# Loop through each page (A to Z)
for char in range(65, 91):  # ASCII codes for A to Z
    page_url = f"{base_url}#{chr(char)}"
    insurers = scrape_insurers(page_url)
    all_insurers.extend(insurers)

# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})

# Display the DataFrame
print(df.head())

# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)

……结果是这样的：

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E

等等等等：

我觉得一开始简化步骤会更容易。

我认为最好先只处理一个单独的URL。这样可以更好地测试我们的请求会返回什么结果。完成这个后，我可以评估请求的结果；我觉得可以用BeautifulSoup库来检查一些常见的特定字段。我认为我应该避免在一步中做三件事情（这可能会出错）。

所以我这样做，先处理第一个字符：A：

import requests
from bs4 import BeautifulSoup

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"

# Define the character we want to fetch data for
char = 'A'

# Construct the URL for the specified character
url = base_url + char

# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)

但是看看这里的输出：

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]

数据提取 beautifulsoup 网页解析数据抓取数据框 requests库 HTML内容保险公司列表

BeautifulSoup：对24个字符（a到z）迭代失败：简化复杂性以初步了解数据集

1 个回答

撰写回答