如何用Python、BeautifulSoup和mechanize从表格提取网页数据

2 投票

1 回答

2150 浏览

提问于 2025-04-16 23:39

我想从这个网站的表格中提取数据：http://www.pgatour.com/r/stats/info/xm.html?101，然后把它保存为 .csv 格式，方便在 iWorks Numbers 中使用。

我尝试用 Python、BeautifulSoup 和 mechanize 来实现这个目标。虽然我看了很多其他的例子，但还是没有成功。我目前做到的就是这些：

from BeautifulSoup import BeautifulSoup, SoupStrainer
from mechanize import Browser
import re
br = Browser()
response = br.open("http://www.pgatour.com/r/stats/info/xm.html?101").read()

接着我用火狐浏览器的 Firebug 工具查看代码，发现我需要提取的是 <tbody> 和 </tbody> 之间的数据。但我不知道该怎么做。

如果有人能帮忙，我会非常感激。

数据解析 html解析 mechanize beautifulsoup 数据抓取 csv格式网页数据提取 Firebug工具

1 个回答

在主页面上，比赛统计数据是通过JavaScript填充的，代码看起来像这样：<div class="tourViewData"> ... populateDDs();。但是，BeautifulSoup（简称BS）无法解析JavaScript，很多其他的StackOverflow问题也提到过这个。我不知道怎么解决这个问题。最糟糕的情况下，可以选择并保存那个HTML部分为本地HTML文件，作为一种变通方法。

首先，为那个网址设置一个BeautifulSoup对象（我使用的是twill，而不是原始的mechanize，你可以在这里放入你的mechanize等价物）：

from BeautifulSoup import BeautifulSoup, SoupStrainer
#from mechanize import Browser
from twill.commands import *
import re

go("http://www.pgatour.com/r/stats/info/xm.html?101")
s = BeautifulSoup(get_browser().get_html())

无论如何，你要找的统计数据表是用<tbody><tr class="tourStatTournHead">标记的表。为了让事情变得有点复杂，它的行标签属性交替定义为<tr class="tourStatTournCellAlt"或者<tr class=""...。我们应该先找到第一个<tr class="tourStatTournCellAlt"，然后处理之后的每一个<tr>，除了表头行（<tr class="tourStatTournHead">）。

要遍历这些行：

tbl = s.find('table', {'class':'tourStatTournTbl'})

def extract_text(ix,tg):
    if ix==2: # player name field, may be hierarchical
        tg = tg.findChildren()[0] if (len(tg.findChildren())>0) else tg
    return tg.text.encode()

for rec in tbl.findAll('tr'): # {'class':'tourStatTournCellAlt'}):
    # Skip header rows
    if (u'tourStatTournHead' in rec.attrs[0]):
        continue        
    # Extract all fields
    (rank_tw,rank_lw,player,rounds,avg,tot_dist,tot_drives) = \
        [extract_text(i,t) for (i,t) in enumerate(rec.findChildren(recursive=False))]
    # ... do stuff

我们添加一个辅助函数来处理球员名字（它可能是分层的，如果里面嵌入了Titleist的标志）。你可能想把大部分字段转换为整数（int()），除了球员名字（字符串）和平均值（浮点数）；如果是这样，记得从排名字段中去掉可选的'T'（表示并列），并去掉总距离（tot_dist）中的逗号。

回答于 2025-04-16 由 Python大师

分享举报

如何用Python、BeautifulSoup和mechanize从表格提取网页数据

1 个回答

撰写回答