如何在解析HTML表格时忽略th标签?

1 投票
1 回答
839 浏览
提问于 2025-05-01 04:31

你好,我刚开始学习用Python和BeautifulSoup4来解析HTML表格。之前一切都很顺利,但我遇到了一个奇怪的表格,它在表格中间使用了一个'th'标签,这导致我的解析程序停止运行,并出现了“索引超出范围”的错误。我尝试在StackOverflow和谷歌上搜索,但没有找到解决办法。我的问题是,如何在解析表格时忽略或去掉这个多余的'th'标签呢?

这是我目前写的代码:

from mechanize import Browser
from bs4 import BeautifulSoup

mech = Browser()
url = 'https://www.moscone.com/site/do/event/list'
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find('table', { 'id' : 'list' })

for row in table.findAll('tr')[3:]:
    col = row.findAll('td')
    date = col[0].string
    name = col[1].string
    location = col[2].string
    record = (name, date, location)
    final = ','.join(record)
    print(final)

这是导致我出错的HTML小片段:

  <td>
   Convention
  </td>
 </tr>
 <tr>
  <th class="title" colspan="4">
   Mon Dec 01 00:00:00 PST 2014
  </th>
 </tr>
 <tr>
  <td>
   12/06/14 - 12/09/14
  </td>

我想要获取这个多余的'th'标签上下的数据,因为它表示表格中新月份的开始。

暂无标签

1 个回答

1

你可以先检查一下 th 是否在 row 里面,如果没有的话就解析内容,像这样:

for row in table.findAll('tr')[3:]:
    # so make sure th is not in row
    if not row.find_all('th'):
        col = row.findAll('td')
        date = col[0].string
        name = col[1].string
        location = col[2].string
        record = (name, date, location)
        final = ','.join(record)
        print(final)

这是我从你提供的网址得到的结果,没有出现 IndexError 的错误:

Out & Equal Workplace,11/03/14 - 11/06/14,Moscone West 
Samsung Developer Conference,11/11/14 - 11/13/14,Moscone West  
North American Spine Society (NASS) Annual Meeting,11/12/14 - 11/15/14,Moscone South and Esplanade Ballroom 
San Francisco International Auto Show,11/22/14 - 11/29/14,Moscone North & South 
67th Annual Meeting of the APS Division of Fluid Dynamics,11/23/14 - 11/25/14,Moscone North, South and West 
American Society of Hematology,12/06/14 - 12/09/14,Moscone North, South and West 
California School Boards Association,12/12/14 - 12/16/14,Moscone North & Esplanade Ballroom 
American Geophysical Union,12/15/14 - 12/19/14,Moscone North & South

撰写回答