urlopen、BeautifulSoup与UTF-8问题

2 投票

2 回答

10452 浏览

提问于 2025-04-15 13:49

我只是想获取一个网页，但不知怎么的，HTML文件里嵌入了一个外文字符。当我使用“查看源代码”时，这个字符是看不见的。

isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page) 
html #This line causes error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)

我还尝试了...

html = BeautifulSoup(page.encode('utf-8'))

我该怎么用BeautifulSoup读取这个网页，而不出现这个错误呢？

2 个回答

你可以试试下面这个：

import codecs 
f = codecs.open(filename,'r','utf-8')
soup = BeautifulSoup(f.read(),"html.parser")

我也遇到过类似的问题，跟bs4有关。

回答于 2025-04-15 由 Python大师

分享举报

这个错误很可能是在你尝试打印 BeautifulSoup 文件的内容时发生的。如果我没猜错的话，这种情况会在你使用交互式控制台时自动出现。

# This code will work fine, note we are assigning the result 
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')

# This will probably show the error you saw
print soup

# And this would probably be fine
print soup.encode('utf-8')

回答于 2025-04-15 由 Python大师

分享举报

urlopen、BeautifulSoup与UTF-8问题

2 个回答

撰写回答