Webscraping如何附加列

2024-06-06 00:38:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在抓取多个GoogleScholar页面,我已经用BeautifulSoup编写了代码来提取标题、作者、期刊等信息

这是一个示例页面。 https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en

现在我想提取有关h-index、I-10索引和引文的信息。当我查看页面时,我看到所有这些都有相同的类名(gsc_rsb_std)。鉴于此,我的怀疑是

  1. 如何使用beautiful soup提取此信息?下面的代码提取了有关引用的信息。由于类名相同,如何为其他两个执行此操作
columns['Citations'] = soup.findAll('td',{'class':'gsc_rsb_std'}).text
  1. 名称、引用、h索引和i索引只有一个值。但是,有多行文件。理想情况下,我希望我的输出为以下形式
Name  h-index  paper1
Name  h-index  paper2
Name  h-index  paper3

我尝试了以下步骤,得到的结果如上所述,但只重复了最后一篇论文。不知道这里发生了什么

soup = BeautifulSoup(driver.page_source, 'html.parser')
columns = {}
columns['Name'] = soup.find('div', {'id': 'gsc_prf_in'}).text
           
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})

for paper in papers:        
   columns['title'] = paper.find('a', {'class': 'gsc_a_at'}).text
   File.append(columns)

我的输出是这样的。看起来这个循环有点问题

Name h-index paper3
Name h-index paper3
Name h-index paper3

谢谢你的帮助。提前谢谢


Tags: columns代码textname信息index页面find
2条回答

您可以使用SelectorGadgetsChrome扩展来直观地获取CSS选择器。下面是一些快速的例子和解释

突出显示在以下内容中的元素:

  • 红色从搜索中排除
  • 绿色包含在搜索中
  • 黄色是指猜测用户要查找的内容,需要进一步澄清

抓取h指数:

enter image description here

Grab i10索引:

enter image description here

要测试的online IDE(bs4_results folder->;get_author_info.py->;uncomment函数中的代码和示例):

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}
  
html = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for cited_by_public_access in soup.select('.gsc_rsb'):
  citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
  citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
  h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
  h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
  i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
  i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
  articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
  articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
  
  print('Citiation info:')
  print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')

输出:

Citiation info:
55399
34899
69
59
148
101
23
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ

或者,您可以使用SerpApi中的Google Scholar Author API执行相同的操作。这是一个付费API,免费试用5000次搜索

在一个特定的例子中,主要的区别在于,您不必猜测和修补如何获取HTML页面的某些元素

另一件事是,您不必考虑如何解决CAPTHCA(它可能会出现在某个点上),或者在需要许多请求的情况下找到好的代理

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "cp-8uaAAAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']

print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')

public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']

print(f'{public_access_link}\n{public_access_available_articles}')

输出:

55399
34899
69
59
148
101

https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
23

Disclaimer, I work for SerpApi.

我会考虑使用:包含和包含和目标搜索字符串

import requests
from bs4 import BeautifulSoup

searches = ['Citations', 'h-index', 'i10-index']
r = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en')
soup = BeautifulSoup(r.text, 'html.parser')

for search in searches:
    all_value = soup.select_one(f'td:has(a:contains("{search}")) + td')
    print(f'{search} All:', all_value.text)
    since_2016 = all_value.find_next('td')
    print(f'{search} since 2016:', since_2016.text)

您还可以使用pandasread_html按索引获取该表


问题:

元素有一个id,使用css选择器/按id查找元素可以更快地匹配,例如

driver.find_element_by_id("gsc_prf_in").text

然而,我认为在刮取这一页时没有必要使用selenium

相关问题 更多 >