如何在Python中遍历HTML表格数据集

5 投票

2 回答

12056 浏览

提问于 2025-04-16 09:32

我第一次在这里发帖，想学点Python的技能，请多多包涵:-)

虽然我对编程的概念并不完全陌生（之前玩过PHP），但转到Python的时候对我来说有点困难。我想这主要是因为我对常见的“设计模式”之类的东西几乎没有什么基础了解。

说到这个，我现在面临的问题是：我的一个项目需要用Beautiful Soup写一个简单的网页抓取工具。需要处理的数据结构和下面的例子有点相似。

<table>
    <tr>
        <td class="date">2011-01-01</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr>
        <td class="date">2011-01-02</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
</table>

主要的问题是，我完全搞不懂如何在1) 循环遍历后面的项目（tr类为“item”的行，里面有td类为“headline”和“link”的单元格）时，2) 同时跟踪当前日期（tr->td类为“date”），以及3) 如何把处理过的数据存储到一个数组里。

另外，所有的数据都要插入到一个数据库里，每条记录必须包含以下信息：

日期
标题
链接

注意，操作数据库并不是问题的一部分，我提到这个只是为了更好地说明我想要实现的目标:-)

最后，解决这个问题的方法有很多种。所以虽然我非常欢迎任何解决方案，但如果有人能详细讲讲你们会用什么逻辑和策略来“解决”这个问题，我会非常感激:-)

最后，抱歉问了这样一个菜鸟问题。

数据处理数据存储网页抓取 html解析 beautiful soup 数据库操作循环遍历编程设计模式

2 个回答

你可以使用Python自带的Element Tree库。

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree

tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements
    #So we're going to loop over them and check their attributes
    if 'class' in eachTableRow.attrib:
        #Good to go. Now we know to look for the headline and link
        pass
    else:
        #Okay, so look for the date
        pass

这应该足够帮助你开始解析这个内容了。

回答于 2025-04-16 由 Python大师

分享举报

这个问题的根本在于，这个表格是为了好看而设计的，而不是为了语义结构。正确的做法是，每个日期和它相关的项目应该有一个共同的父元素。但现在并没有这样做，所以我们只能想办法解决。

基本的策略是逐行遍历这个表格：

如果第一列的数据有'class'属性为'date'，我们就获取这个日期的值，并更新last_seen_date。
否则，我们提取一个标题和一个链接，然后把(last_seen_date, 标题, 链接)保存到数据库中。

import BeautifulSoup

fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))

items = []
last_seen_date = None
for el in soup.findAll('tr'):
    daterow = el.find('td', {'class':'date'})
    if daterow is None:     # not a date - get headline and link
        headline = el.find('td', {'class':'headline'}).text
        link = el.find('a').get('href')
        items.append((last_seen_date, headline, link))
    else:                   # get new date
        last_seen_date = daterow.text

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中遍历HTML表格数据集

2 个回答

撰写回答