如何在Python中获取`http-equiv`？

3 投票

4 回答

1157 浏览

数据工程师

提问于 2025-04-16 08:01

我正在使用 urllib2.urlopen 来获取一个网址，并获取一些头部信息，比如 'charset' 和 'content-length'。

但是有些网页是通过这样的方式设置它们的字符集：

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

而 urllib2 并不会帮我解析这些信息。

有没有什么内置的工具可以用来获取 http-equiv 的信息呢？

编辑：

这是我用来从网页中解析 charset 的方法：

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower()
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

我该如何改进这个方法呢？我可以预编译 xpath 查询吗？

4 个回答

自己写一个HTML解析器比你想象的要难得多，之前的回答也建议使用库来完成这个工作。不过，我推荐的不是BeautifulSoup或lxml，而是html5lib。这个解析器最能模拟浏览器解析网页的方式，特别是在处理编码方面：

解析出来的树结构总是使用Unicode格式。不过，它支持多种输入编码。文档的编码是通过以下方式确定的：

你可以通过将编码名称作为参数传递给HTMLParser.parse来明确指定编码。

如果没有指定编码，解析器会尝试从文档的前512个字节中的一个元素来检测编码（这只是当前HTML 5规范的部分实现）。

如果找不到编码，并且有chardet库可用，解析器会尝试根据字节模式来猜测编码。

如果这些方法都不行，解析器会使用默认编码（通常是Windows-1252）。

来源: http://code.google.com/p/html5lib/wiki/UserDocumentation

回答于 2025-04-16 由 Python大师

分享举报

我需要解析这个内容（还有其他一些东西）来为我的在线HTTP获取工具服务。我使用lxml来解析网页，并获取meta等价头，大致的做法如下：

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ...

你可以做一个更复杂的查询，直接获取合适的标签（通过在查询中指定name=），但在我的情况下，我是解析所有的meta标签。我把这个留给你自己去练习，这里有相关的lxml文档。

Beautifulsoup被认为有点过时，已经不再积极开发了。

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup查找'http-equiv'

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中获取`http-equiv`？

4 个回答

使用BeautifulSoup查找'http-equiv'

撰写回答