我试图用以下代码解析带有requests
和BeautifulSoup
库的任意网页:
try:
response = requests.get(url)
except Exception as error:
return False
if response.encoding == None:
soup = bs4.BeautifulSoup(response.text) # This is line 809
else:
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
在大多数网页上,这一切都很好。但是,在一些任意页面上(<;1%)我会遇到这样的崩溃:
^{pr2}$作为参考,以下是请求库的相关方法:
@property
def text(self):
"""Content of the response, in unicode.
if Response.encoding is None and chardet module is available, encoding
will be guessed.
"""
# Try charset from content-type
content = None
encoding = self.encoding
# Fallback to auto-detected encoding.
if self.encoding is None:
if chardet is not None:
encoding = chardet.detect(self.content)['encoding']
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace') # This is line 809
except LookupError:
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
可以看出,当抛出这个错误时,我并没有传递编码。如何不正确地使用库?如何防止此错误?这是在python3.2.3上实现的,但是我也可以在python2中得到相同的结果。在
这意味着服务器没有为标头中的内容发送编码,
chardet
库也无法确定内容的编码。实际上,您故意测试是否缺少编码;如果没有可用的编码,为什么要尝试获取解码文本?在您可以尝试将解码留给
BeautifulSoup
解析器:而且不需要将编码传递给BeautifulSoup,因为如果
^{pr2}$.text
没有失败,那么您使用的是Unicode,beautifulGroup无论如何都会忽略编码参数:相关问题 更多 >
编程相关推荐