HTML编码和lxml解析

<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body></body> </html>

<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Unicode Chars: 은 —’</title> </head> <body></body> </html>

from BeautifulSoup import UnicodeDammit def decode_html(html_string): converted = UnicodeDammit(html_string, isHTML=True) if not converted.unicode: raise UnicodeDecodeError( "Failed to detect encoding, tried [%s]", ', '.join(converted.triedEncodings)) # print converted.originalEncoding return converted.unicode root = lxml.html.fromstring(decode_html(tag_soup))

2条回答

网友

1楼 · 编辑于 2024-04-20 08:43:25

这个问题可能源于这样一个事实，即<meta charset>是一个相对较新的标准（如果我没有弄错的话，那就是HTML5，或者它之前并没有真正使用过）

在更新lxml.html库以反映它之前，您需要特别处理该情况。

如果您只关心ISO-8859-*和UTF-8，并且能够舍弃非ASCII兼容的编码（例如UTF-16或东亚传统字符集），那么您可以对字节字符串执行正则表达式替换，将较新的<meta charset>替换为较旧的http-equiv格式。

否则，如果您需要一个正确的解决方案，您最好自己修补库（并在修复过程中提供修补程序）。您可能需要询问lxml开发人员是否有针对此特定错误的半生不熟的代码，或者他们是否首先在其错误跟踪系统上跟踪该错误。

网友

2楼 · 编辑于 2024-04-20 08:43:25

lxml与处理Unicode相关的several issues。最好（目前）在显式指定字符编码时使用字节：

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

输出

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章