如何使用Python的BeautifulSoup去掉HTML标签之间的空格？

1 投票

3 回答

2574 浏览

提问于 2025-04-16 16:07

我遇到了一个问题：当HTML标签之间有空格时，我的代码没有输出我想要的文本。

我本来希望输出：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

但实际上我得到的是：

 |salary|bonus
2005|100,000|50,000
2006|120,000|80,000

文本“year”没有被输出。

这是我的代码：

from BeautifulSoup import BeautifulSoup
import re


html = '<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td><td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td></tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table></html>'
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

store=[]

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        try:
            row.append(''.join(td.find(text=True)))
        except Exception:
            row.append('')
    store.append('|'.join(filter(None, row)))
print '\n'.join(store)

问题出在这里的空格：

"<td> <p>year</p></td>"

有没有办法在从网上提取HTML时去掉这个空格呢？

字符串处理网页抓取数据清洗 HTML beautifulsoup 空格去除

3 个回答

使用

html = re.sub(r'\s\s+', '', html)

回答于 2025-04-16 由 Python大师

分享举报

正如@Herman所建议的，你应该使用Tag.text来找到你正在解析的标签相关的文本。

再详细说一下为什么Tag.find()没有达到你的预期：BeautifulSoup的Tag.find()和Tag.findAll()非常相似，实际上，Tag.find()的实现就是调用Tag.findAll()，并设置一个关键词参数limit为1。然后Tag.findAll()会递归地向下查找标签树，一旦找到满足text参数的文本就返回。因为你把text设置为True，所以字符"u' '"在技术上满足这个条件，因此Tag.find()返回的就是这个。

实际上，如果你打印td.findAll(text=True, limit=2)，你会看到年份被返回。你也可以把text设置为一个正则表达式来忽略空格，这样你就可以用td.find(text=re.compile('[\S\w]'))来查找。

我还注意到你在使用store.append('|'.join(filter(None, row)))。我觉得你应该使用CSV模块，特别是csv.writer。CSV模块可以处理你在解析的HTML文件中可能遇到的任何管道问题，并且让你的代码更整洁。

这里有一个例子：

import csv
import re
from cStringIO import StringIO

from BeautifulSoup import BeautifulSoup


html = ('<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td>'
        '<td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td>'
        '</tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table>'
        '</html>')
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

output = StringIO()
writer = csv.writer(output, delimiter='|')

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        row.append(td.text)

    writer.writerow(filter(None, row))

print output.getvalue()

输出结果是：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

回答于 2025-04-16 由 Python大师

分享举报

不要使用 row.append(''.join(td.find(text=True)))，可以改用：

row.append(''.join(td.text))

输出结果：

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

回答于 2025-04-16 由 Python大师

分享举报

如何使用Python的BeautifulSoup去掉HTML标签之间的空格？

3 个回答

撰写回答