如何使用BeautifulSoup从某个url下的所有子url获取信息?

2024-05-14 13:21:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我的用例是尝试从父url下的子url(如https://blueprint.uchicago.edu/organization/acacouncil)获取所有电子邮件:https://blueprint.uchicago.edu/organizations

我知道电子邮件的一般形式是xyz@xyz.com,因此定位单个url的电子邮件就足够容易了。但是,当涉及到为所有子URL这样做时,我有点不知所措


Tags: https定位comurl电子邮件用例形式blueprint
1条回答
网友
1楼 · 发布于 2024-05-14 13:21:50

在这里使用beautifulsoup没有任何意义,因为您可以直接从api获取数据。首先,您需要知道有多少个组织,以便在查询中使用这些组织。然后,通过抓取'WebsiteKey'或组织id,您可以迭代api以提取电子邮件。您可以存储在字典、表格、打印输出等中。不确定您真正想要的输出是什么

import requests
import pandas as pd

url = 'https://blueprint.uchicago.edu/api/discovery/search/organizations'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
payload = {
'orderBy[0]': 'UpperName asc',
'top': '',
'filter':'',
'query':'' ,
'skip': '0'}
data = requests.get(url, headers=headers, params=payload).json()

totalCount = data['@odata.count']
payload = {
'orderBy[0]': 'UpperName asc',
'top': '%s' %totalCount,
'filter':'',
'query':'' ,
'skip': '0'}


data = requests.get(url, headers=headers, params=payload).json()

organizations = {}
for each in data['value']:
    organizations[each['Name']] = {'id':each['Id'], 'WebsiteKey':each['WebsiteKey']}




emails = {}
for name, each in organizations.items():
    websiteKey = each['WebsiteKey']
    org_id = each['id']

    url = 'https://blueprint.uchicago.edu/api/discovery/organization/bykey/%s' %websiteKey
    data = requests.get(url, headers=headers).json()
    emails[name] = data['email']
    print('%-70s: %s' %(name, data['email']))

df = pd.DataFrame(list(zip(emails.keys(), emails.values())), columns=['Organization','Email'])
df.to_csv('file.csv', index=False)

输出:

{'A Cappella Council': 'uchicagoacappella@gmail.com', 'ACLU University of Chicago Law Chapter': 'dhbabrams@uchicago.edu', 'Active Minds at the University of Chicago': 'activemindsuchicago@gmail.com', 'African and Caribbean Student Association': 'cvleito@uchicago.edu', 'Aikido Kokikai': 'nahmadc@uchicago.edu', 'Alpha Kappa Psi': 'edwardchang@uchicago.edu', 'Alpha Phi Omega': 'uchi.apo.president@gmail.com', 'American Civil Liberties Union at University of Chicago': 'acluboard@lists.uchicago.edu', 'American Constitution Society': 'acs@law.uchicago.edu', 'American Medical Student Association': None, 'American Red Cross of University of Chicago': 'rkhouri@uchicago.edu', 'Amnesty International': 'eckere@uchicago.edu', 'Animal Legal Defense Fund - The University of Chicago Law School': 'ntschepik@uchicago.edu', 'Animal Welfare Society': 'petrucci@uchicago.edu', 'Anthropology Students Association': 'frevelolarotta@uchicago.edu', 'Apsara': 'uchicagoapsara@gmail.com', 'Arab Student Association': 'malakarafa@uchicago.edu', ...}

相关问题 更多 >

    热门问题