在Python中，如何使用BeautifulSoup在文本字符串之后找到表？

from BeautifulSoup import BeautifulSoup, SoupStrainer import re html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>'] soup = BeautifulSoup(''.join(html)) searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space foundtext = soup.findAll('p',text=searchtext) soupafter = foundtext.findAllNext() table = soupafter.find('table') # find the next table after the search string is found rows = table.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: try: text = ''.join(td.find(text=True)) except Exception: text = "" print text+"|", print

1条回答

网友

1楼 · 发布于 2024-04-20 09:11:08

错误是由于^{}是Tag对象的方法，而foundtext是ResultSet对象，是匹配标记或字符串的列表所致。您可以遍历foundtext中的每个标记，但根据需要，使用^{}可能就足够了，它只返回第一个匹配的标记。

这是你的代码的修改版本。在将foundtext改为使用soup.find之后，我发现并修复了table的相同问题。我把你的正则表达式修改为ignore whitespace between the words：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

这将输出：

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

相关问题更多 >

编程相关推荐

热门问题

热门文章