BeautifulSoup:对24个字符(a到z)迭代失败:简化复杂性以初步了解数据集
我在西班牙有一个保险公司列表,这些信息分成了24个部分,收集在一个网站上。你可以查看这个链接:
保险公司 - 西班牙语: 完整列表:https://www.unespa.es/en/directory
这个列表分成了24页: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z
我的想法是:我想用BS4和requests从这些页面获取数据,最后把它保存到一个数据框中: 我觉得用Python的BeautifulSoup(BS4)和requests来抓取网站上的列表是合适的;我认为我们需要按照以下步骤进行:
a. 首先,我们需要导入必要的库:BeautifulSoup、requests和pandas。 b. 然后,我们需要使用requests库获取每个感兴趣页面的HTML内容,也就是从A到Z的页面。 c. 接下来,我会用BeautifulSoup来解析HTML内容。 d. 然后,我认为从解析后的HTML中提取相关信息(保险公司的名称)是下一步。 e. 最后,我想把提取的数据存储到一个pandas数据框中。
但是这样做不成功……从A到Z的迭代也不行:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/"
# List to store all insurers
all_insurers = []
# Loop through each page (A to Z)
for char in range(65, 91): # ASCII codes for A to Z
page_url = f"{base_url}#{chr(char)}"
insurers = scrape_insurers(page_url)
all_insurers.extend(insurers)
# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})
# Display the DataFrame
print(df.head())
# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)
……结果是这样的:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E
等等等等:
我觉得一开始简化步骤会更容易。
我认为最好先只处理一个单独的URL。这样可以更好地测试我们的请求会返回什么结果。完成这个后,我可以评估请求的结果;我觉得可以用BeautifulSoup库来检查一些常见的特定字段。 我认为我应该避免在一步中做三件事情(这可能会出错)。
所以我这样做,先处理第一个字符:A:
import requests
from bs4 import BeautifulSoup
# Function to scrape insurers from a given URL
def scrape_insurers(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting insurer names
insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
return insurers
else:
print("Failed to retrieve data from", url)
return []
# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"
# Define the character we want to fetch data for
char = 'A'
# Construct the URL for the specified character
url = base_url + char
# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)
但是看看这里的输出:
Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]
1 个回答
1
试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.unespa.es/en/directory/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = []
for c in soup.select(".contact-item"):
for t in c.select("span, a"):
t.unwrap()
c.smooth()
title, *other = c.get_text(separator="|||", strip=True).split("|||")
data.append(
{"Title": title, **{(s := d.split(":", maxsplit=1))[0]: s[1] for d in other}}
)
df = pd.DataFrame(data)
print(df)
输出结果是:
Title Tfno. Fax Web Dirección Email
0 A.M.A., AGRUPACIÓN MUTUAL ASEGURADORA, MUTUA DE SEGUROS APF 91 343 47 00 (91) 343 47 68 http://www.amaseguros.com VÍA DE LOS POBLADOS, 3 28033 (MADRID) NaN
1 ABANCA GENERALES DE SEGUROS Y REASEGUROS 881920742 / 881920744 NaN NaN AV. LINARES RIVAS 30, 3º 15005 A CORUÑA (A CORUÑA) NaN
2 ABANCA VIDA Y PENSIONES DE SEGUROS Y REASEGUROS, S.A. 981 188 075 NaN NaN AVENIDA DE LA MARINA, 1-3ª PLANTA 15001 A CORUÑA (A CORUÑA) NaN
3 ADMIRAL EUROPE COMPAÑIA DE SEGUROS S.A.U. (AECS) NaN NaN https://www.admiraleurope.com/ RODRÍGUEZ MARÍN, 61 - 1ª PLANTA 28016 MADRID (MADRID) NaN
4 AEGON ESPAÑA, SOCIEDAD ANÓNIMA DE SEGUROS Y REASEGUROS 91 563 62 22 NaN http://www.aegon.es VÍA DE LOS POBLADOS, 3 - EDIFICIO 4B - PARQUE EMPRESARIAL CRISTALIA 28033 (MADRID) NaN
5 AGROPELAYO SOCIEDAD DE SEGUROS, SOCIEDAD ANÓNIMA NaN NaN NaN SANTA ENGRACIA, 67 - 69 28010 (MADRID) NaN
...