使用BeautifulSoup查找包含特定文本的HTML标签

0 投票

1 回答

2291 浏览

提问于 2025-04-18 00:28

我正在使用BeautifulSoup和Python进行网页数据抓取。

比如，我有以下的html文本：

<body>
    <h5 class="h-bar">
        <b class="caret"></b>
        Model 11111
        Set Item
    </h5>
</body>

现在，我想找到任何包含“Set Item”这个词的标签。

我尝试了以下代码：

soup.find_all('h5', text="Set Item")

我本来希望能得到这个结果：

    <h5 class="h-bar">
        <b class="caret"></b>
        Model 11111
        Set Item
    </h5>

但是，这个结果返回的是None。我不明白为什么BeautifulSoup找不到匹配的内容……我该怎么做才能找到包含“Set Item”文本的标签呢？

文本匹配数据提取 html解析 beautifulsoup 网页爬虫网页数据抓取标签查找

1 个回答

我也是BeautifulSoup的新手。肯定有更好的方法，但这个方法看起来可以用：

from bs4 import BeautifulSoup
import re

def predicate(element):
    pattern = re.compile(r'Set Item')
    return element.name == u'h5' and element.find(text=pattern) 

if __name__ == '__main__':
    soup = BeautifulSoup(open('index.html').read())
    found = soup.find_all(predicate) # found: a list of elements
    print 'Found:', found

请原谅我使用open().read()这种写法。我只是有点懒。

输出结果：

Found: [<h5 class="h-bar">
<b class="caret"></b>
        Model 11111
        Set Item
    </h5>]

更新

其实条件判断不需要用正则表达式：

def predicate(e):
    return e and e.name == u'h5' and 'Set Item' in e.text

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup查找包含特定文本的HTML标签

1 个回答

更新

撰写回答