使用Python和Requests抓取网页时的字符集问题

1 投票

1 回答

576 浏览

提问于 2025-04-17 18:38

当我尝试下载一个中文页面时（根据元标签显示是gb2312编码），我看到的却是一些乱码符号，比如ê×××(ò)，这些本该是中文字符。我运行下面的代码后，在gEdit中以gb2312格式打开文件时出现了这个问题。

这是相关页面的源代码：https://gist.github.com/anonymous/27663069655db7fd7a19 - 这个网站只供教育机构使用。

我的代码：

r = requests.post("http://example.com", data=payload, cookies=cookies)
f = open('myfile.txt', 'w')
f.write(r.text.encode('gb2312',errors="ignore"))
f.close()

页面的头部信息：

{'content-length': '6164', 'x-powered-by': 'ASP.NET', 'date': 'Mon, 11 Mar 2013 05:11:24 GMT', 'cache-control': 'private', 'content-type': 'text/html', 'server': 'Microsoft-IIS/6.0'}

如果我尝试解码而不是编码，我在Python中会遇到这个错误：

f.write(r.text.decode('gb2312',errors="ignore"))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2017-2018: ordinal not in range(128)

字符集网页抓取编码问题请求库解码错误乱码 gb2312 http头部信息

1 个回答

djc@enrai http $ python
Python 2.7.3 (default, Jun 18 2012, 09:39:59)
[GCC 4.5.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> rsp = urllib.urlopen('https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html').read()
>>> import chardet
>>> chardet.detect(rsp)
{'confidence': 0.99, 'encoding': 'utf-8'}
>>> rsp.decode('utf-8')
u'\n<HTML><HEAD>(snip)</BODY></HTML>\n'

所以，我想说，不要太相信字符集的头信息，对吧？

回答于 2025-04-17 由 Python大师

分享举报

使用Python和Requests抓取网页时的字符集问题

1 个回答

撰写回答