用BeautifulSoup解析HTML并移除标签

3 投票

1 回答

1805 浏览

提问于 2025-04-16 03:23

我刚开始学Python，正在用BeautifulSoup来解析一个网站，然后提取数据。我有以下的代码：

for line in raw_data: #raw_data is the parsed html separated into smaller blocks
    d = {}
    d['name'] = line.find('div', {'class':'torrentname'}).find('a')
    print d['name']

<a href="/ubuntu-9-10-desktop-i386-t3144211.html">
<strong class="red">Ubuntu</strong> 9.10 desktop (i386)</a>

通常我可以通过写以下代码来提取'Ubuntu 9.10 desktop (i386)'：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').string

但是由于强制的html标签，它返回的是None。有没有办法提取这些强制标签，然后再用.string，或者有没有更好的方法？我试过用BeautifulSoup的extract()函数，但没能成功。

补充：我刚意识到，如果有两组强制标签的话，我的解决方案就不管用了，因为单词之间的空格会被去掉。有什么办法可以解决这个问题吗？

数据提取 html解析空格处理 beautifulsoup 网页爬虫标签移除 extract函数强制标签

1 个回答

使用“.text”属性：

d['name'] = line.find('div', {'class':'torrentname'}).find('a').text

或者可以在findAll(text=True)上进行连接：

anchor = line.find('div', {'class':'torrentname'}).find('a')
d['name'] = ''.join(anchor.findAll(text=True))

回答于 2025-04-16 由 Python大师

分享举报

用BeautifulSoup解析HTML并移除标签

1 个回答

撰写回答