修复Requests返回的双重编码UTF-8

0 投票

2 回答

2020 浏览

提问于 2025-04-17 21:12

我正在使用 Requests 获取一个 Atom 响应，但遇到了编码问题：

当我用 curl 获取时，显示是正确的，能看到 Ä：

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:zapi="http://zotero.org/ns/api">
<title>The power broker : Robert Moses and the fall of New York</title>
(snip)
<content zapi:type="citation" type="xhtml">
    <span xmlns="http://www.w3.org/1999/xhtml">(Robert Ä. Caro 1974)</span>
</content>
</entry>

但是当我在 Python 2.7.4 上使用 requests 2.2.1 获取时，我得到的是这个 Unicode 响应：

import requests
r = requests.get(url)
r.text
u'<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom" xmlns:zapi="http://zotero.org/ns/api">
<title>The power broker : Robert Moses and the fall of New York</title>
(snip)
<content zapi:type="citation" type="xhtml">
    <span xmlns="http://www.w3.org/1999/xhtml">(Robert \u0102\x84. Caro 1974)</span>
</content>
</entry>'

当然，把它编码成 utf-8 并不能让我找回我的 Ä。该怎么办呢？

数据处理 unicode utf-8 编码问题请求库 atom 响应

2 个回答

你确定不是curl在试图创建一个“已知”的字母来替换掉那个用“Ă”表示的字母吗？根据搜索这个作者的名字，那个“A”应该就是普通的“A”（罗伯特·艾伦·卡罗）。而\x84这个字符本身是一个闭合引号的unicode字符——你可以查看这个链接了解更多信息：http://www.fileformat.info/info/unicode/category/Cc/list.htm。所以这可能是从某个地方扫描“罗伯特·‘A.’·卡罗”时出现的识别错误，服务器上显示的内容和你在Python中看到的是一样的。

试着用curl加上--raw选项来检查一下实际的内容。

（我对这个字符串做了一些尝试，我觉得这个假设比双重编码的可能性更大。）

回答于 2025-04-17 由 Python大师

分享举报

因为你没有提供服务器返回的响应头信息，所以我不能确定具体情况。不过我猜测服务器返回了一个用utf8编码的字符串，但设置的字符集却是错误的：

Content-Type: text/html; charset=iso-8859-1

这样的话，请求就会把它当作一串字节（在python2中叫做str），然后根据那个错误的字符集把这个字符串解码成unicode字符串。接着再把这个unicode字符串重新编码成latin1，然后再解码回utf8，就能得到原来的字符串了。

r.encode('iso-8859-1').decode('utf8')

不过，使用r.content的话，你会得到一个str类型的结果，你可以手动应用正确的编码，把它解码成utf8。

回答于 2025-04-17 由 Python大师

分享举报

修复Requests返回的双重编码UTF-8

2 个回答

撰写回答