BeautifulSoup不返回Unicode

4 投票

2 回答

4179 浏览

提问于 2025-04-16 00:55

我正在使用Beautiful Soup来抓取数据。BS的文档上说，BS应该总是返回Unicode，但我似乎无法得到Unicode。这里有一段代码：

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

从页面返回的原始数据是一个字符串。BS显示原始编码为ISO-8859-1。我以为BS会自动把所有东西转换成Unicode，那为什么当我这样做的时候：

table = soup.table
print type(table.renderContents())

...它给我的是一个字符串对象，而不是Unicode呢？

我该如何从BS获取Unicode对象呢？

我真的很困惑。有什么帮助吗？提前谢谢！

2 个回答

originalEncoding 就是指原始的编码方式，所以即使 BS（Beautiful Soup）内部把所有东西都存储为 Unicode，这个值也不会改变。当你遍历树形结构时，所有的文本节点都是 Unicode，所有的标签也是 Unicode，等等，除非你用其他方法把它们转换成其他格式（比如使用 print、str、prettify 或 renderContents）。

你可以尝试做一些这样的操作：

soup = BeautifulSoup(data)
print type(soup.contents[0])

不幸的是，你到目前为止所做的其他操作，都是在使用 BS 中很少的几种可以转换成字符串的方法。

回答于 2025-04-16 由 Python大师

分享举报

你可能注意到了，renderContent 默认返回的是一个用 UTF-8 编码的字符串。不过，如果你真的想要一个表示整个文档的 Unicode 字符串，你也可以使用 unicode(soup) 或者对 renderContents/prettify 的输出进行解码，方法是使用 unicode(soup.prettify(), "utf-8")。

相关链接

如何在 BeautifulSoup 中以 Unicode 格式渲染标签的内容？

回答于 2025-04-16 由 Python大师

分享举报

BeautifulSoup不返回Unicode

2 个回答

撰写回答