如何仅将此网站HTML表的第一列和href链接刮入数据框？

url = "http://www.annualreports.com/Companies?search=" html = request.urlopen(url).read().decode('utf8') soup = BeautifulSoup(html, "html.parser") df = pd.DataFrame(columns=['Company', 'Href']) tables = soup.findChildren('table') my_table = tables[0] rows = my_table.findChildren(['th', 'tr']) for row in rows: cells = row.findChildren('td') for cell in cells: value = cell.string print(value)

1条回答

网友
1楼 · 发布于 2024-04-20 02:18:06

您可以使用nth-of-type限制到第一列（td）。由于节点同时具有感兴趣的href和文本，您可以使用列表理解中的元组从同一节点检索这两个元素，然后依靠末尾的熊猫来处理列。我正在使用bs4.7.1。不确定从哪个版本开始支持此功能，但由于所做的改进，您确实希望使用最新的bs4
import requests import pandas as pd from bs4 import BeautifulSoup as bs r = requests.get('http://www.annualreports.com/Companies?search=') soup = bs(r.content, 'lxml') df = pd.DataFrame([(i.text, 'http://www.annualreports.com' + i['href']) for i in soup.select('tbody td:nth-of-type(1) a')], columns = ['Company','Link']) print(df)
一些解释：
soup.select('tbody td:nth-of-type(1) a')
选择第一列（td）中的所有子a标记tbody用于确保使用正确的表。 tbody,td和{}是type selectors并基于标记进行选择，而两者之间的空格是descendant combinators，这意味着右侧要匹配的元素是左侧要匹配的元素的子元素
select返回一个列表
列表理解
[(i.text, 'http://www.annualreports.com' + i['href']) for i in soup.select('tbody td:nth-of-type(1) a')]
可以改写为：
for i in soup.select('tbody td:nth-of-type(1) a'): (i.text, 'http://www.annualreports.com' + i['href']) #tuple that is then added to a final list
迭代select返回的列表中的每个a标记时；当前节点（a标记）既有标题，也有其.text 属性，并将href作为属性。可以访问属性值，如图所示。添加'http://www.annualreports.com'前缀是为了使链接完整（否则它们是相对的，缺乏协议和域）
该列表被传递给pandas，其中元组列表（根据示例称之为the_list）被解压到两列中。pd.DataFrame的columns参数用于命名数据帧中的列
df = pd.DataFrame(the_list , columns = ['Company','Link']) # the_list being the result of the list comprehension

相关问题更多 >

编程相关推荐

热门问题

热门文章