如何在Python中使用BeautifulSoup查找文本字符串后的表格？

6 投票

1 回答

9441 浏览

提问于 2025-04-16 16:00

我正在尝试从几个网页中提取数据，这些网页的表格显示方式各不相同。我需要写一段代码，去搜索一个特定的文本字符串，然后找到紧跟在这个字符串后面的表格。接着，我想提取这个表格的内容。到目前为止，我写的代码是：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
print

但是，我遇到了以下错误：

    soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

有没有简单的方法可以使用BeautifulSoup来做到这一点？

数据提取文本搜索 beautifulsoup 网页解析错误调试表格处理

1 个回答

这个错误是因为 findAllNext 是 Tag 对象的方法，而 foundtext 是一个 ResultSet 对象，简单来说就是一个包含匹配标签或字符串的列表。你可以遍历 foundtext 中的每一个标签，但根据你的需求，使用 find 可能就足够了，因为它只会返回第一个匹配的标签。

这里有一个修改过的代码版本。在把 foundtext 改成使用 soup.find 后，我发现并解决了 table 的同样问题。我还修改了你的正则表达式，以忽略单词之间的空格：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

这段代码的输出是：

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中使用BeautifulSoup查找文本字符串后的表格？

1 个回答

撰写回答