抓取表格并从链接获取更多信息

1 投票

2 回答

567 浏览

提问于 2025-04-18 01:15

我正在使用Python和BeautifulSoup来抓取一个表格……我对获取大部分需要的信息已经有了不错的掌握。以下是我想抓取的表格的简化版。

<tr> <td><a href="/wiki/Joseph_Carter_Abbott" title="Joseph Carter Abbott">Joseph Carter  Abbott</a></td> <td>1868–1872</td> <td>North Carolina</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td>
</tr> 
<tr> <td><a href="/wiki/James_Abdnor" title="James Abdnor">James Abdnor</a></td> <td>1981–1987</td> <td>South Dakota</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> </tr> <tr> <td><a href="/wiki/Hazel_Abel" title="Hazel Abel">Hazel Abel</a></td> <td>1954</td> <td>Nebraska</td> <td><a href="/wiki/Republican_Party_(United_States)" title="Republican Party (United States)">Republican</a></td> 
</tr>

http://en.wikipedia.org/wiki/List_of_former_United_States_senators

我想要的信息包括：姓名、描述、任职年份、州、政党。

描述是每个人页面上的第一段文字。我知道怎么单独获取这个信息，但我不知道怎么把它和姓名、任职年份、州、政党结合起来，因为我需要去不同的页面。

哦，对了，我还需要把这些信息写入一个CSV文件。

谢谢！

信息提取 beautifulsoup 网页解析数据抓取超链接表格处理 csv文件数据整合

2 个回答

如果你在使用BeautifulSoup这个工具，你并不是像在浏览器里那样去点击链接跳转到另一个页面，而是直接用一个新的网址去请求那个页面，比如说。所以你的代码可能看起来像这样：

import urllib, csv

with open('out.csv','w') as f:

    csv_file = csv.writer(f)

    #loop through the rows of the table
    for row in senator_rows:
        name = get_name(row)

        ... #extract the other data from the <tr> elt

        senator_page_url = get_url(row)

        #get description from HTML text of senator's page
        description = get_description(get_html(senator_page_url))

        #write this row to the CSV file
        csv_file.writerow([name, ..., description])

#quick way to get the HTML text as string for given url
def get_html(url):
    return urllib.urlopen(url).read()

需要注意的是，在Python 3.x版本中，你要用urllib.request来代替urllib，而且你还得把read()返回的bytes数据进行解码。听起来你已经知道怎么填充我留在那里的其他get_*函数了，希望这些信息对你有帮助！

回答于 2025-04-18 由 Python大师

分享举报

我想进一步解释一下@anrosent的回答：在解析过程中发送请求是一个很好的、稳定的方法。不过，你用来获取描述的函数也必须正常工作，因为如果它返回一个NoneType错误，整个过程就会乱套。

我在这方面的做法是这样的（注意我使用的是Requests库，而不是urllib或urllib2，因为我对这个更熟悉——你可以根据自己的喜好进行更改，逻辑是一样的）：

from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv

ofile = open("presidents.csv", "wb")
f = csv.writer(ofile)
f.writerow(["Name","Description","Years","State","Party"])
base_url = "http://en.wikipedia.org/wiki/List_of_former_United_States_senators"
r = rq.get(base_url)
soup = bsoup(r.content)
all_tables = soup.find_all("table", class_="wikitable")

def get_description(url):
    r = rq.get(url)
    soup = bsoup(r.content)
    desc = soup.find_all("p")[0].get_text().strip().encode("utf-8")
    return desc

complete_list = []
for table in all_tables:
    trs = table.find_all("tr")[1:] # Ignore the header row.
    for tr in trs:
        tds = tr.find_all("td")
        first = tds[0].find("a") 
        name = first.get_text().encode("utf-8")
        desc = get_description("http://en.wikipedia.org%s" % first["href"])
        years = tds[1].get_text().encode("utf-8")
        state = tds[2].get_text().encode("utf-8")
        party = tds[3].get_text().encode("utf-8")
        f.writerow([name, desc, years, state, party])

ofile.close()

不过，这个尝试在David Barton之后的那一行就结束了。如果你查看页面，可能是因为他占用了两行。这部分需要你自己来解决。错误追踪信息如下：

Traceback (most recent call last):
  File "/home/nanashi/Documents/Python 2.7/Scrapers/presidents.py", line 25, in <module>
    name = first.get_text().encode("utf-8")
AttributeError: 'NoneType' object has no attribute 'get_text'

另外，注意我的get_description函数是在主要流程之前的。这显然是因为你必须先定义这个函数。最后，我的get_description函数并不完美，因为如果某种情况下，单独页面中的第一个p标签不是你想要的那个，它就会失败。

结果示例：

在这里输入图片描述

注意那些错误的行，比如Maryon Allen的描述。这部分也需要你来修正。

希望这些能给你指明方向。

回答于 2025-04-18 由 Python大师

分享举报

抓取表格并从链接获取更多信息

2 个回答

撰写回答