迭代10k页面并提取数据:欧洲志愿服务:收集EU网站机会的微型抓取器
我在找一个关于欧洲志愿服务的公开列表:我不需要完整的地址,只要名字和网站就行。我想要的数据格式可以是XML、CSV等,字段包括:名字、国家,另外如果能有一些额外的信息就更好了,每个国家一条记录。顺便说一下:欧洲的志愿服务对年轻人来说是很不错的选择。
我找到了一页非常全面的网站;想要从欧洲志愿服务中收集数据,这些服务都在一个欧洲的网站上:
请看这里:https://youth.europa.eu/go-abroad/volunteering/opportunities_en
@HedgeHog给我展示了正确的方法,教我如何找到合适的选择器,具体可以参考这个讨论:BeatuifulSoup遍历10k页并获取数据,解析:欧洲志愿服务:一个小爬虫收集来自欧盟网站的机会
# Extracting relevant data
title = soup.h1.get_text(', ',strip=True)
location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ',strip=True)
start_date,end_date = (e.get_text(strip=True)for e in soup.select('span.extra strong')[-2:])
不过我们在那儿找到了几百个志愿机会,这些信息存储在像下面这样的网页上:
https://youth.europa.eu/solidarity/placement/39020_en
https://youth.europa.eu/solidarity/placement/38993_en
https://youth.europa.eu/solidarity/placement/38973_en
https://youth.europa.eu/solidarity/placement/38972_en
https://youth.europa.eu/solidarity/placement/38850_en
https://youth.europa.eu/solidarity/placement/38633_en
想法:
我觉得收集这些数据会很棒,比如用一个基于BS4
和requests
的爬虫,解析数据后再把它打印到一个dataframe
里。
我觉得我们可以遍历所有的URL:
placement/39020_en
placement/38993_en
placement/38973_en
placement/38850_en
想法:我认为我们可以从0到100000遍历,获取所有存储在岗位中的结果。但这个想法目前没有代码支持。换句话说,我现在还不知道如何实现这个遍历这么大范围的特殊想法:
目前我觉得,开始时用这个基本的方法就可以:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Function to generate placement URLs based on a range of IDs
def generate_urls(start_id, end_id):
base_url = "https://youth.europa.eu/solidarity/placement/"
urls = [base_url + str(id) + "_en" for id in range(start_id, end_id+1)]
return urls
# Function to scrape data from a single URL
def scrape_data(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.h1.get_text(', ', strip=True)
location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)
start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])
website_tag = soup.find("a", class_="btn__link--website")
website = website_tag.get("href") if website_tag else None
return {
"Title": title,
"Location": location,
"Start Date": start_date,
"End Date": end_date,
"Website": website,
"URL": url
}
else:
print(f"Failed to fetch data from {url}. Status code: {response.status_code}")
return None
# Set the range of placement IDs we want to scrape
start_id = 1
end_id = 100000
# Generate placement URLs
urls = generate_urls(start_id, end_id)
# Scrape data from all URLs
data = []
for url in urls:
placement_data = scrape_data(url)
if placement_data:
data.append(placement_data)
# Convert data to DataFrame
df = pd.DataFrame(data)
# Print DataFrame
print(df)
这样可以给我返回以下内容
Failed to fetch data from https://youth.europa.eu/solidarity/placement/154_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/156_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/157_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/159_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/161_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/162_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/163_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/165_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/166_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/169_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/170_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/171_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/173_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/174_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/176_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/177_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/178_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/179_en. Status code: 404
Failed to fetch data from https://youth.europa.eu/solidarity/placement/180_en. Status code: 404
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-d6272ee535ef> in <cell line: 42>()
41 data = []
42 for url in urls:
---> 43 placement_data = scrape_data(url)
44 if placement_data:
45 data.append(placement_data)
<ipython-input-5-d6272ee535ef> in scrape_data(url)
16 title = soup.h1.get_text(', ', strip=True)
17 location = soup.select_one('p:has(i.fa-location-arrow)').get_text(', ', strip=True)
---> 18 start_date, end_date = (e.get_text(strip=True) for e in soup.select('span.extra strong')[-2:])
19 website_tag = soup.find("a", class_="btn__link--website")
20 website = website_tag.get("href") if website_tag else None
ValueError: not enough values to unpack (expected 2, got 0)
有什么想法吗?
请看基本网址:https://youth.europa.eu/go-abroad/volunteering/opportunities_en
2 个回答
感谢你的帮助和建议,亲爱的刺猬,这里是一个完整的解决方案,能够正常工作。
import requests
import pandas as pd
# Function to fetch data from the API and convert it into a DataFrame
def fetch_data():
url = 'https://youth.europa.eu/d8/api/rest/eyp/v1/search_en?type=Opportunity&size=100&from=0&filters%5Bstatus%5D=open&filters%5Bdate_end%5D%5Boperator%5D=%3E%3D&filters%5Bdate_end%5D%5Bvalue%5D=2024-03-14&filters%5Bdate_end%5D%5Btype%5D=must'
response = requests.get(url)
if response.status_code == 200:
data = response.json().get('hits').get('hits')
df = pd.json_normalize(data)
return df
else:
print(f"Failed to fetch data from the API. Status code: {response.status_code}")
return None
# Fetch data from the API
df = fetch_data()
# Store DataFrame in a CSV file
if df is not None:
df.to_csv('volunteering_opportunities.csv', index=False)
print("DataFrame successfully stored in 'volunteering_opportunities.csv'")
它返回一个基于CSV格式的数据集,大小为500KB。
非常感谢!
与其自己创建ID,我更倾向于使用API的方法,直接获取已经整理好的JSON格式的信息。这样可以通过pandas.json_normalize()
转换成一个dataframe
。
示例
import requests
import pandas as pd
data = requests.get('https://youth.europa.eu/d8/api/rest/eyp/v1/search_en?type=Opportunity&size=100&from=0&filters%5Bstatus%5D=open&filters%5Bdate_end%5D%5Boperator%5D=%3E%3D&filters%5Bdate_end%5D%5Bvalue%5D=2024-03-14&filters%5Bdate_end%5D%5Btype%5D=must').json().get('hits').get('hits')
pd.json_normalize(data)
你可以简单地查看浏览器的网络选项卡,了解数据来自哪里,以及如何通过负载过滤结果,具体可以参考开发者工具。
type: Opportunity
size: 1000
from: 0
filters[status]: open
filters[date_end][operator]: >=
filters[date_end][value]: 2024-03-14
filters[date_end][type]: must