右至左字符\u200f在Python中导致问题

0 投票

1 回答

1427 浏览

提问于 2025-04-18 15:51

我正在使用urllib读取一个网页，这个网页是utf-8编码的，并且包含从右到左的字符 http://www.charbase.com/200f-unicode-right-to-left-mark

但是当我尝试把这些内容写入一个基于UTF-8的文本文件时

with codecs.open("results.html","w","utf-8") as outFile:
    outFile.write(htmlResults) 
    outFile.close()

我收到了一个错误信息： "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 264: ordinal not in range(128)" ……

我该怎么解决这个问题呢？

unicode character encoding utf-8 urllib text file handling encoding issues unicode error right-to-left text

1 个回答

如果 htmlResults 是字符串类型（str），那么你需要弄清楚它是什么编码格式，这样才能把它解码成Unicode（只有Unicode可以被编码）。比如说，如果 htmlResults 是用iso-8859-1（也就是latin-1）编码的，那么

tmp = htmlResults.decode('iso-8859-1')

这段代码会在tmp中创建一个Unicode字符串，你可以把它写入一个文件：

with codecs.open("results.html","w","utf-8") as outFile:
    tmp = htmlResults.decode('iso-8859-1')
    outFile.write(tmp)

如果 htmlResults 是用utf-8编码的，那你就不需要做任何解码或编码的操作：

with open('results.html', 'w') as fp:
    fp.write(htmlResults)

（使用with语句会自动为你关闭文件）。

不过，这和浏览器如何解析这个文件没有关系，浏览器是根据web服务器提供的 Content-Type 和相关的meta标签来判断的。例如，如果这个文件是html5格式的，你应该在标签的顶部附近加上这个：

<meta charset="UTF-8">

回答于 2025-04-18 由 Python大师

分享举报