如何用Beautiful Soup查找特定文本的标签?

49 投票
8 回答
156757 浏览
提问于 2025-04-17 11:04

如何在下面的HTML中找到我想要的 text I am looking for(换行用 \n 标记)?

...
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>
...

下面的代码只返回找到的第一个值,所以我需要通过某种方式过滤出 "Fixed text:"

result = soup.find('td', {'class' :'pos'}).find('strong').text

更新:如果我使用以下代码:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

那么它只会返回 Fixed text:,而不是同一个元素中用 <strong> 标记的高亮文本。

8 个回答

13

在bs4 4.7.1及以上版本中,你可以使用:contains伪类来指定包含你要搜索的字符串的td元素。接着,你可以使用子元素组合器,这样就可以找到包含目标文本的strong元素:

from bs4 import BeautifulSoup as bs

html = '''
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)

从soupsieve 2.1.0开始:

新功能:为了避免将来CSS规范变化带来的冲突,非标准的伪类现在会以:-soup-开头。因此,:contains()现在被称为:-soup-contains(),不过在一段时间内,旧的:contains()形式仍然可以使用,但会有警告提示用户应该迁移到:-soup-contains()。

新功能:新增了一个非标准伪类:-soup-contains-own(),它的工作方式类似于:-soup-contains(),但只关注与当前元素直接相关的文本节点,而不包括其子元素。

摘自@facelessuser的GitHub页面。

32

这篇帖子虽然没有直接给出答案,但让我找到了答案,所以我觉得应该分享一下。

这里的挑战在于使用 BeautifulSoup.find 时,搜索带文本和不带文本的行为不一致。

注意: 如果你有安装BeautifulSoup,可以在本地测试一下:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

代码: https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previous': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

虽然可能已经太晚了,无法帮助提问者,但我希望他们能把这个当作答案,因为它解决了关于通过文本查找的所有疑问。

54

你可以把一个正则表达式传递给findAll的文本参数,像这样:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

撰写回答