BS4：获取标签中的文本

18 投票

4 回答

69158 浏览

数据工程师

提问于 2025-04-18 16:53

我在使用Beautiful Soup这个库。这里有一个这样的标签：

<li><a href="example"> s.r.o., <small>small</small></a></li>

我想只提取标签里的文字，不想要标签里的内容，也就是说我只想要“s.r.o.,”。

我试过用find('li').text[0]，但是没有成功。

在Beautiful Soup 4里，有没有什么命令可以做到这一点呢？

beautiful soup 网页解析文本提取标签提取 html 处理

4 个回答

根据文档，要获取标签的文本内容，可以通过调用字符串属性来实现。

soup = BeautifulSoup('<li><a href="example"> s.r.o., <small>small</small></a></li>')
res = soup.find('a')
res.small.decompose()
print(res.string)
# s.r.o.,

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring

回答于 2025-04-18 由 Python大师

分享举报

如果你想要循环打印出HTML字符串或网页中所有的锚点标签（也就是链接），并且必须使用urllib里的urlopen，这段代码可以做到：

from bs4 import BeautifulSoup
data = '<li><a href="example">s.r.o., <small>small</small</a></li> <li><a href="example">2nd</a></li> <li><a href="example">3rd</a></li>'
soup = BeautifulSoup(data,'html.parser')
a_tag=soup('a')
for tag in a_tag:
    print(tag.contents[0])     #.contents method to locate text within <a> tags

输出结果：

s.r.o.,  
2nd
3rd

a_tag是一个列表，里面包含了所有的锚点标签；把所有的锚点标签放在一个列表里，可以方便地进行批量编辑（如果有多个<a>标签的话）。

>>>print(a_tag)
[<a href="example">s.r.o.,  <small>small</small></a>, <a href="example">2nd</a>, <a href="example">3rd</a>]

回答于 2025-04-18 由 Python大师

分享举报

使用 .children 方法

soup.find('a').children.next()
s.r.o.,

回答于 2025-04-18 由 Python大师

分享举报

一种方法是从contents中获取第一个元素，这个元素是a标签的内容。

>>> from bs4 import BeautifulSoup
>>> data = '<li><a href="example"> s.r.o., <small>small</small></a></li>'
>>> soup = BeautifulSoup(data)
>>> print soup.find('a').contents[0]
 s.r.o.,

另一种方法是找到small标签，然后获取它的前一个兄弟元素。

>>> print soup.find('small').previous_sibling
 s.r.o.,

当然，还有各种其他的选择和疯狂的办法：

>>> print next(soup.find('a').descendants)
 s.r.o., 
>>> print next(iter(soup.find('a')))
 s.r.o.,

回答于 2025-04-18 由 Python大师

分享举报

BS4：获取标签中的文本

4 个回答

撰写回答