Python与BeautifulSoup编码问题

31 投票

5 回答

79479 浏览

提问于 2025-04-17 00:28

我正在用Python和BeautifulSoup写一个爬虫，之前一切都很顺利，直到我遇到这个网站：

http://www.elnorte.ec/

我用requests库获取内容：

r = requests.get('http://www.elnorte.ec/')
content = r.content

在这个时候，如果我打印内容变量，所有的西班牙语特殊字符看起来都没问题。但是，一旦我把内容变量传给BeautifulSoup，它们就全乱了：

soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&amp;month=08&amp;day=27&amp;modid=203" title="1009 artÃculos en este dÃa">
...

显然，所有的西班牙语特殊字符（比如重音符号等）都被搞混了。我试过用content.decode('utf-8')，content.decode('latin-1')，还尝试调整BeautifulSoup的fromEncoding参数，设置成fromEncoding='utf-8'和fromEncoding='latin-1'，但还是不行。

如果能给点建议就太好了。

utf-8 数据解析 beautifulsoup 特殊字符编码问题 latin-1 requests库爬虫

5 个回答

你可以试试这个方法，它适用于所有编码方式。

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
headers = {"User-Agent": USERAGENT}
resp = requests.get(url, headers=headers)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)

回答于 2025-04-17 由 Python大师

分享举报

在你的情况下，这个页面有错误的utf-8数据，这让BeautifulSoup搞混了，以为你的页面使用的是windows-1252编码。你可以试试这个方法：

soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))

这样做的话，你就能把页面源代码中错误的符号去掉，BeautifulSoup就能正确猜测编码了。

你可以把'ignore'换成'replace'，这样就可以检查文本中是否有'?'符号，看看哪些内容被去掉了。

其实，写一个能每次都100%正确猜测页面编码的爬虫是非常困难的（现在的浏览器在这方面做得很好）。你可以使用像'chardet'这样的模块，但比如在你的情况下，它会猜测编码为ISO-8859-2，这也是不正确的。

如果你真的需要能够获取任何用户可能提供的页面编码，你应该构建一个多层次的检测函数（比如先试utf-8，再试latin1，等等……就像我们在项目中做的那样），或者使用一些来自firefox或chromium的检测代码作为C模块。

回答于 2025-04-17 由 Python大师

分享举报

你可以试试：

r = urllib.urlopen('http://www.elnorte.ec/')
x = BeautifulSoup.BeautifulSoup(r.read)
r.close()

print x.prettify('latin-1')

我得到了正确的输出。哦，在这种特殊情况下，你也可以用 x.__str__(encoding='latin1')。

我想这是因为内容是用 ISO-8859-1(5) 编码的，而网页的 meta 标签中错误地写成了 "UTF-8"。

你能确认一下吗？

回答于 2025-04-17 由 Python大师

分享举报

Python与BeautifulSoup编码问题

5 个回答

撰写回答