如果utf-8编码的html文件包含非utf-8字符，会怎样？

2 投票

2 回答

1448 浏览

提问于 2025-04-17 10:32

我正在尝试使用BeautifulSoup来解析编码为UTF-8的html文件。但是，不幸的是，这个html文件里有一些不是UTF-8编码的字符，所以显示不正确。不过这对我来说没关系，因为我可以简单地跳过这些字符。

问题是，即使我直接指定编码为utf-8：

soup = BeautifulSoup (html,fromEncoding='utf-8')

结果发现，soup.originalEncoding自动设置为默认的windows-1252编码。

print soup.originalEncoding
windows-1252

我查阅了BeautifulSoup的文档，上面写着：

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

 - An encoding you pass in as the fromEncoding argument to the soup
   constructor.
 - An encoding discovered in the document itself
 - An encoding sniffed by looking at the first few bytes of the file. If
   an encoding is detected at this stage, it will be one of the UTF-*
   encodings, EBCDIC, or ASCII.
 - An encoding sniffed by the chardet library, if you have it installed.
 - UTF-8
 - Windows-1252

看起来它应该使用我指定的fromEncoding，而不是最后一个编码。

这里有一个我正在解析的原始html供你参考。

字符编码 utf-8 数据清洗 beautifulsoup 编码问题 html编码文档解析 windows-1252

2 个回答

你提到的页面看起来是用UTF-8编码的，但里面有一些字节序列是UTF-8编码中不应该出现的。这些问题可能是因为代码转换不正确，或者是插入了其他编码的数据。不过，这些只是内容数据而已。

UTF-8是“自我同步”的，所以如果你跳过那些错误的字节，其他部分应该没问题。而且一旦你到达HTML标记部分，所有内容都在ASCII范围内。标记中重要的字符总是以小于0x80的单个字节出现。

回答于 2025-04-17 由 Python大师

分享举报

如果你知道文件的编码是什么，建议在把字符串传给BeautifulSoup之前先进行解码，并且明确忽略那些不是utf-8的字符。

unicode_html = myfile.read().decode('utf-8', 'ignore')
soup = BeautifulSoup (unicode_html)

回答于 2025-04-17 由 Python大师

分享举报

如果utf-8编码的html文件包含非utf-8字符，会怎样？

2 个回答

撰写回答