python:unicode问题 - 问答 - Python中文网

python:unicode问题

2024-05-15 00:32:16 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试解码从文件中提取的字符串：

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00 \x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00 \x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00 \x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00 \x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00 \x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00 \x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00 \x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00 \x002\x000\x001\x000\x00\t\x00A\x00d\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00 \x00A\x00v\x00g\x00.\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00 \x00W\x00e\x00b\x00 \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'

添加“忽略”并没有真正的帮助…：

In [69]: data[2] Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'
In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)
/Users/oleg/ in ()
/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):
: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)
In [71]:

Tags： x00 x00t x00a x00h x00l x00r x000 x001

3条回答

网友

1楼 · 编辑于 2024-05-15 00:32:16

这个文件是一个UTF-16-LE编码的文件，带有一个初始的BOM。

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

网友

2楼 · 编辑于 2024-05-15 00:32:16

编辑

既然你发布了2.7，这就是2.7解决方案：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

忽略不可编码字符：

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

网友

3楼 · 编辑于 2024-05-15 00:32:16

这看起来像是UTF-16数据。所以试试看

data[0].rstrip("\n").decode("utf-16")

编辑（用于更新）：尝试一次解码整个文件，即

data = open(...).read()
data.decode("utf-16")

问题是UTF-16中的换行符是“\n\x00”，但是使用readlines()将在“\n”处拆分，留下“\x00”字符作为下一行。

相关问题更多 >

编程相关推荐

热门问题

热门文章