BeautifulSoup: 无论有多少个外部标签，只获取内部标签内容

52 投票

5 回答

106995 浏览

提问于 2025-04-15 23:27

我正在尝试使用BeautifulSoup从网页中的<p>元素提取所有内部的HTML内容。里面有一些标签，但我不在乎，我只想获取里面的文本。

举个例子，对于：

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

我想提取：

Red
Blue
Yellow
Light green

使用.string或者.contents[0]都不能满足我的需求。.extract()也不行，因为我不想提前指定内部的标签——我想处理可能出现的任何标签。

在BeautifulSoup中有没有那种“只获取可见HTML”的方法呢？

----更新------

根据建议，我尝试了：

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

但这没有帮助——它输出的是：

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8

数据处理网页抓取 html解析 beautifulsoup 文本提取内部标签可见内容

5 个回答

我遇到了同样的问题，想分享一下2019年的解决方案。希望能对某些人有所帮助。

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

注意，我们首先是一个一个地打印数组里的内容，然后再调用get_text()方法，这个方法会把文本中的标签去掉，这样我们只打印出纯文本。

另外：

在bs4中，使用更新版的'find_all()'比旧版的findAll()要好。
urllib2被urllib.request和urllib.error替代，具体可以查看这里。

现在你的输出应该是：

红色
蓝色
黄色
浅色

希望这能帮助到正在寻找更新解决方案的人。

回答于 2025-04-15 由 Python大师

分享举报

这个被认可的回答很好，但已经有6年了，所以这里是当前的Beautiful Soup 4版本的回答：

>>> txt = """\
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green

回答于 2025-04-15 由 Python大师

分享举报

简短的回答是：soup.findAll(text=True)

这个问题已经有人回答过了，可以在StackOverflow上找到答案，还有在BeautifulSoup的文档里。

更新：

为了更清楚，这里有一段可以运行的代码：

>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green

回答于 2025-04-15 由 Python大师

分享举报

BeautifulSoup: 无论有多少个外部标签，只获取内部标签内容

5 个回答

撰写回答