HTML编码与lxml解析

Question

我正在尝试解决一些编码问题，这些问题是在用lxml抓取HTML时出现的。这里有三个我遇到的HTML示例：

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

这是我基本的脚本：

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

结果是：

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

显然，第一个示例有问题，因为缺少了<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />标签。这里的解决方案来自这个链接，它能正确识别第一个示例为utf-8，因此在功能上与我原来的代码是等效的。

lxml的文档似乎有些矛盾：

在这个链接中，示例似乎建议我们使用UnicodeDammit将标记编码为unicode。

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

然而在这个链接中，它说：

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

如果我试着按照lxml文档中的第一个建议来做，我的代码现在是：

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

我现在得到的结果是：

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

第一个示例现在可以正常工作，但第三个示例由于<?xml version="1.0" encoding="utf-8"?>标签而出现错误。

有没有一种正确的方法来处理所有这些情况？有没有比下面的解决方案更好的方法？

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

lxml unicode 解析网页抓取 HTML 编码问题文档解析标签处理

HTML编码与lxml解析

2 个回答

输出结果

撰写回答