使用Beautiful Soup查找和正则替换不在<a></a>中的文本

3 投票

1 回答

5517 浏览

提问于 2025-04-16 21:54

我正在使用Beautiful Soup这个工具来解析HTML，目的是找到所有不在链接（也就是标签）里的文本。

我写了一个代码，它可以找到所有的链接，但反过来却不行。

我想知道怎么修改这个代码，只提取纯文本，这样我就可以进行一些查找和替换的操作，来修改解析后的内容。

for a in soup.findAll('a',href=True):
    print a['href']

编辑：

举个例子：

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

输出结果：

This should be identified 
Identify me 1 
Identify me 2
This paragraph should be identified.

我做这个操作是为了找到不在标签里的文本：然后找到“Identify”，并把它替换成“Replaced”。

所以最后的输出结果会是这样的：

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Repalced me 1 

   Replaced me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

谢谢你的时间！

正则表达式文本替换网页抓取 html解析数据清洗 beautiful soup 标签处理内容提取

1 个回答

如果我理解得没错，你想要获取一个包含 href 属性的 a 元素里面的文本。如果你想获取这个元素的文本，可以使用 .text 属性。

>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('<a href="http://something.com">this is some text</a>')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'

编辑

这个方法会找到所有包含文本的元素：

>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n   This should be identified \n\n   Identify me 1 \n\n   Identify me 2 \n   ', u' identified ']

返回的对象是 BeautifulSoup.NavigableString 类型。如果你想检查这个元素的父元素是否是 a 元素，可以用 txt.parent.name == 'a' 来判断。

再编辑：

这里有一个使用正则表达式和替换的例子。

import BeautifulSoup
import re

soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
    if re.search('identi',txt,re.I) and txt.parent.name != 'a':
        newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
        txt.replaceWith(newtext)
print(soup)


<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br /></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br /></div><div><br /></div>
<div>
   this should be replacefied 

   replacefy me 1 

   replacefy me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

回答于 2025-04-16 由 Python大师

分享举报

使用Beautiful Soup查找和正则替换不在<a></a>中的文本

1 个回答

撰写回答