使用BeautifulSoup的replaceWith替换所有'a'标签的内容

0 投票

1 回答

1392 浏览

提问于 2025-04-17 18:44

编辑：基本上，我想进行一个分解操作，但不是简单地删除一个标签和它里面的内容，而是想把这个标签替换成它的内容。

我想把一个html文档中的所有'a'标签替换成它们的内容，以字符串的形式。这样我就能更方便地把html写入csv文件。不过，我在替换这一步遇到了问题。我一直在尝试使用BeautifulSoup的replace_with()来实现这个功能，但结果并没有如我所愿。

# Import modules
from bs4 import BeautifulSoup
from urllib2 import urlopen

# URL to soup
URL = 'http://www.barringtonhills-il.gov/foia/ordinances_12.htm'
html_content = urlopen(URL).read()
soup = BeautifulSoup(html_content)

# Replaces links with link text
links = soup.find_all('a')
for link in links:
    linkText = link.contents[0]
    linkTextCln = '%s' % (linkText.string)
    if linkTextCln != 'None':
        link.replaceWith(linkTextCln)
        print link

这个操作返回的是：

<a href="index.htm">Home</a>
<a href="instruct.htm">Instructions</a>
<a href="requests.htm">FOIA Requests</a>
<a href="kiosk.htm">FOIA Kiosk</a>
<a href="geninfo.htm">Government Profile</a>
etc etc etc

但我期望的结果是：

Home
Instructions
FOIA Requests
FOIA Kiosk
Government Profile
etc etc etc

有没有人知道为什么replaceWith没有按预期工作？有没有更好的方法来解决这个问题？

字符串操作 html解析数据清洗 beautifulsoup 文档处理 csv文件 web爬虫标签替换

1 个回答

我认为使用bs4库时，现在的方法是replace_with，不过如果你只是想输出标签里的内容，可以使用下面的代码：

from bs4 import BeautifulSoup

s = '''
<a href="index.htm">Home</a>
<a href="instruct.htm">Instructions</a>
<a href="requests.htm">FOIA Requests</a>
<a href="kiosk.htm">FOIA Kiosk</a>
<a href="geninfo.htm">Government Profile</a>
'''
soup = BeautifulSoup(s, 'html.parser')

for tag in soup.findAll('a'):
    print(tag.string)

输出结果：

Home
Instructions
FOIA Requests
FOIA Kiosk
Government Profile
[Finished in 0.2s]

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup的replaceWith替换所有'a'标签的内容

1 个回答

撰写回答