缺少<tr>的表格：我可以解析吗？

1 投票

1 回答

619 浏览

提问于 2025-04-18 13:23

我正在尝试解析一个看起来像这样的表格：

<table>
    <tr> <th> header1 </th> <th> header2 </th> </tr>
    <th> missing1 </th> <th> missing2 </th>
    <tr> <td> data1 </td> <td> data2 </td> </tr>
</table>

我特别需要找到包含“missing”这个词的那一行。有没有办法可以找到这一行呢？这个表格在浏览器中显示得很好，所以我本来以为BeautifulSoup可以找到它，但用 b.findAll('tr') 却找不到。

补充说明：这是一个具体的、复杂得多的例子：http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag=，特别是标题为“Line Transition”的那张表，跨越了几列。

具体问题的例子：

import requests
from bs4 import BeautifulSoup
r = BeautifulSoup(requests.get('http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag=').content)
table = r.select('table:nth-of-type(5) tr')

table 缺少了这一行（在源代码中是有的）：r.select('table tr')[19]

数据提取网页抓取 html解析数据清洗 beautifulsoup 行列操作表格解析缺失数据

1 个回答

这要看解析器是怎么处理的。HTML本身有点问题，虽然HTML解析器会尽量把数据呈现出来，但它们的处理方式并没有统一的标准。

BeautifulSoup可以使用不同的解析器；默认情况下，它会使用Python自带的标准库解析器。如果你安装了lxml，那么它就会用这个解析器。你还可以使用html5lib这个外部模块：

>>> from bs4 import BeautifulSoup
>>> broken = '''\
... <table>
...     <tr> <th> header1 </th> <th> header2 </th> </tr>
...     <th> missing1 </th> <th> missing2 </th>
...     <tr> <td> data1 </td> <td> data2 </td> </tr>
... </table>
... '''
>>> BeautifulSoup(broken, 'html.parser').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]
>>> BeautifulSoup(broken, 'lxml').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]
>>> BeautifulSoup(broken, 'html5lib').select('table tr')
[<tr> <th> header1 </th> <th> header2 </th> </tr>, <tr><th> missing1 </th> <th> missing2 </th>
    </tr>, <tr> <td> data1 </td> <td> data2 </td> </tr>]

如你所见，html5lib解析器把包含missing文本的那一行也包含在树结构里了：

>>> BeautifulSoup(broken, 'html5lib').select('table tr:nth-of-type(2)')
[<tr><th> missing1 </th> <th> missing2 </th>
    </tr>]

如果你需要通过标题查找特定的表格，可以先搜索标题，然后再找到它的父表格：

import requests
from bs4 import BeautifulSoup

url = 'http://atlasgal.mpifr-bonn.mpg.de/cgi-bin/ATLASGAL_SEARCH_RESULTS.cgi?text_field_1=AGAL010.472%2B00.027&catalogue_field=Sextractor&gc_flag='
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html5lib')

table = soup.find(text='Fitted Parameters for Observed Molecular Transitions').find_parent('table')
for row in table.find_all('tr'):
    print row

回答于 2025-04-18 由 Python大师

分享举报

缺少<tr>的表格：我可以解析吗？

1 个回答

撰写回答