使用BeautifulSoup查找包含特定文本的HTML标签

74 投票

3 回答

110068 浏览

提问于 2025-04-15 11:36

我正在尝试获取一个HTML文档中包含以下文本模式的元素：#\S{11}

<h2> this is cool #12345678901 </h2>

所以，之前的内容可以通过以下方式匹配：

soup('h2',text=re.compile(r' #\S{11}'))

结果会是这样的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够获取所有匹配的文本（见上面的行）。但是我想要的是匹配文本的父元素，这样我就可以用它作为遍历文档树的起点。在这个例子中，我希望返回所有的h2元素，而不是文本匹配的内容。

有什么想法吗？

3 个回答

使用bs4（Beautiful Soup 4），提问者的尝试完全按照预期工作：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回的结果是[<h2> 这太酷了 #12345678901 </h2>]。

回答于 2025-04-15 由 Python大师

分享举报

使用BeautifulSoup进行搜索时，如果你用text=作为条件，它会返回一系列BeautifulSoup.NavigableString对象，而在其他情况下则返回BeautifulSoup.Tag对象。你可以查看这些对象的__dict__，里面列出了可以使用的属性。在这些属性中，parent比previous更受欢迎，这是因为在BS4中有了一些变化。

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

回答于 2025-04-15 由 Python大师

分享举报

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

输出：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

回答于 2025-04-15 由 Python大师

分享举报

使用BeautifulSoup查找包含特定文本的HTML标签

3 个回答

撰写回答