使用BeautifulSoup根据内容值提取标签内容

3 投票

4 回答

3257 浏览

提问于 2025-04-17 10:31

我有一个这样的Html文档。

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

我想提取段落标签里的内容，包括斜体和粗体的内容，但不想要链接标签里的内容。而且，可能还想忽略开头的数字。

我希望得到的结果是：段落中的斜体内容，但不包括粗体内容。

有什么好的方法可以做到这一点吗？

另外，下面的代码片段返回了类型错误：'NoneType'类型的参数无法迭代。

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

谢谢大家的建议。

文本处理类型错误 html解析 beautifulsoup 内容过滤标签提取

4 个回答

你遇到的问题跟 string 有关，正如在文档中解释的那样，string 只有在以下情况下可用：

如果一个标签只有一个子节点，并且那个子节点是一个字符串。

所以在你的情况下，p.string 是 None，这意味着你不能对它进行循环操作。要获取标签的内容，你需要使用 p.contents（这是一个包含标签的列表）或者 p.text（这是一个去掉所有标签的字符串）。

在你的情况下，你可能需要这样的代码：

>>> ''.join([str(e) for e in soup.p.contents
                    if not isinstance(e, BeautifulSoup.Tag)
                       or e.name != 'a'])
>>> '&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> .'

如果你还需要去掉前面的 `' '`，我建议使用正则表达式来从最终的字符串中去掉那部分。

回答于 2025-04-17 由 Python大师

分享举报

我觉得你只需要遍历一下

标签里面的内容，然后把想要的字符串收集起来就行了。

如果使用lxml这个库，你可以用XPath来实现：

import lxml.html as LH
import re

content = '''\
<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>'''

doc = LH.fromstring(content)
ptext = ''.join(doc.xpath('//p/descendant-or-self::*[not(self::a)]/text()'))
pat = r'^.*\d+.\s*'
print(re.sub(pat,'',ptext))

这样就能得到：

Content of the paragraph  in italic  but not  strong  .

回答于 2025-04-17 由 Python大师

分享举报

你的代码出错是因为当标签只有一个子元素，并且那个子元素是 NavigableString 时，tag.string 会被设置。

你可以通过提取 a 标签来实现你想要的效果：

from BeautifulSoup import BeautifulSoup

s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

for p in soup.findAll('p'):
    for a in p.findAll('a'):
        a.extract()
    print ''.join(p.findAll(text=True))

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup根据内容值提取标签内容

4 个回答

撰写回答