在Python中获取HTTP响应的字符集/编码的好方法

31 投票

6 回答

54326 浏览

提问于 2025-04-17 14:07

我在找一个简单的方法，用Python的urllib2或者其他库来获取HTTP响应的字符集/编码信息。

>>> url = 'http://some.url.value'
>>> request = urllib2.Request(url)
>>> conn = urllib2.urlopen(request)
>>> response_encoding = ?

我知道这个信息有时候会出现在'Content-Type'这个头部里，但这个头部还有其他信息，而且字符集信息是嵌在一个字符串里的，我需要把它解析出来。比如，谷歌返回的Content-Type头部是

>>> conn.headers.getheader('content-type')
'text/html; charset=utf-8'

我可以处理这个，但我不确定这个格式是否总是一样。我很确定字符集有可能完全缺失，所以我得考虑这种特殊情况。用某种字符串分割的方法把'utf-8'提取出来，感觉这不是处理这种事情的正确方式。

>>> content_type_header = conn.headers.getheader('content-type')
>>> if '=' in content_type_header:
>>>  charset = content_type_header.split('=')[1]

这种代码让我觉得做了太多的工作。我也不确定它在每种情况下都能工作。有没有人有更好的方法来解决这个问题？

error handling http urllib2 http response content-type encoding character set string parsing

6 个回答

requests库让这个变得简单：

>>> import requests
>>> r = requests.get('http://some.url.value')
>>> r.encoding
'utf-8' # e.g.

回答于 2025-04-17 由 Python大师

分享举报

如果你对 Flask 或 Werkzeug 这些网页开发工具有点了解，那么你会很高兴地知道，Werkzeug库正好有办法处理这种HTTP头部的解析问题，而且它还考虑到了内容类型完全没有指定的情况，这正是你想要的。

 >>> from werkzeug.http import parse_options_header
 >>> import requests
 >>> url = 'http://some.url.value'
 >>> resp = requests.get(url)
 >>> if resp.status_code is requests.codes.ok:
 ...     content_type_header = resp.headers.get('content_type')
 ...     print content_type_header
 'text/html; charset=utf-8'
 >>> parse_options_header(content_type_header) 
 ('text/html', {'charset': 'utf-8'})

所以你可以这样做：

 >>> content_type_header[1].get('charset')
 'utf-8'

注意，如果没有提供 charset，那么结果会变成：

 >>> parse_options_header('text/html')
 ('text/html', {})

即使你只提供一个空字符串或空字典，它也能正常工作：

 >>> parse_options_header({})
 ('', {})
 >>> parse_options_header('')
 ('', {})

所以看起来这正是你一直在寻找的东西！如果你查看源代码，你会发现他们在设计时就考虑到了你的需求： https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):
    """Parse a ``Content-Type`` like header into a tuple with the content
    type and the options:
    >>> parse_options_header('text/html; charset=utf8')
    ('text/html', {'charset': 'utf8'})
    This should not be used to parse ``Cache-Control`` like headers that use
    a slightly different format.  For these headers use the
    :func:`parse_dict_header` function.
    ...

希望这能在某一天帮助到某个人！ :)

回答于 2025-04-17 由 Python大师

分享举报

要解析HTTP头信息，你可以使用cgi.parse_header()这个方法：

_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8

或者你也可以使用响应对象：

response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)

一般来说，服务器可能会对编码信息撒谎，或者根本不报告编码（默认情况取决于内容类型），有时候编码可能在响应内容里指定，比如在HTML文档中的<meta>标签，或者在XML文档的声明部分。作为最后的手段，你也可以根据内容本身来猜测编码。

你可以使用requests库来获取Unicode文本：

import requests # pip install requests

r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding

或者使用BeautifulSoup来解析HTML（同时也会转换成Unicode）：

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...

或者直接使用bs4.UnicodeDammit来处理任意内容（不一定是HTML）：

from bs4 import UnicodeDammit

dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8

回答于 2025-04-17 由 Python大师

分享举报

在Python中获取HTTP响应的字符集/编码的好方法

6 个回答

撰写回答