从控制台输入时,转储为JSON时出现UnicodeDecodeError
我从控制台输入了一些西里尔字母的文本,当我尝试把它转成json格式时,出现了 exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte
的错误。我搞不清楚为什么会这样,因为这个问题并不是每次都会发生,而且文本总是西里尔字母。
这是我输入文本的代码部分:
item['title'] = raw_input('Title: ')
item['description'] = raw_input('Description: ')
这是我把字典转成json的那一行代码:
line = json.dumps(dict(item), encoding='utf8') + "\n"
这个项目不是字典,而是一个对象,所以我需要先把它转换成字典。以下是完整的错误追踪信息:
Traceback (most recent call last):
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/dmitry/.virtualenvs/test_scrapy/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/dmitry/Dropbox/coding/python/scrapy/videos_parser/videos_parser/pipelines.py", line 94, in process_item
line = json.dumps(dict(item), encoding='utf8') + "\n"
File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python2.7/json/encoder.py", line 233, in _encoder
o = o.decode(_encoding)
File "/home/dmitry/.virtualenvs/test_scrapy/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 15: invalid continuation byte
sys.getdefaultencoding()
显示我正在使用 ascii
编码。我尝试用 sys.setdefaultencoding('utf8')
把它改成utf8,但没有成功。
更新
这是我用来查看字符串在解码前样子的代码:
try:
item['title'] = raw_input('Title: ')
item['title'] = item['title'].decode(sys.stdin.encoding)
except UnicodeDecodeError:
print repr(item['title'])
try:
item['description'] = raw_input('Description: ')
item['description'] = item['description'].decode(sys.stdin.encoding)
except UnicodeDecodeError:
print repr(item['description'])
这是控制台输出的结果:
Title: На работе платят бабло, но работать надо на ней
'\xd0\x9d\xd0\xb0 \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xd0\xb5 \xd0\xbf\xd0\xbb\xd0\xb0\xd1\x82\xd1\x8f\xd1\x82 \xd0\xb1\xd0\xb0\xd0\xb1\xd0\xbb\xd0\xbe, \xd0\xbd\xd0\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbd\xd0\xb0\xd0\xb4\xd0\xbe \xd0\xbd\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb9'
Description: Я не против первого, но без второго мне веселей
'\xd0\xaf \xd0\xbd\xd0\xb5 \xd0\xbf\xd1\x80\xd0\xbe\xd1\x82\xd0\xb8\xd0\xb2 \xd0\xbf\xd0\xb5\xd1\x80\xd0\xb2\xd0\xbe\xd0\xb3\xd0\xbe \xd0, \xd0\xbd\xd0\xbe \xd0\xb1\xd0\xb5\xd0\xb7 \xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbc\xd0\xbd\xd0\xb5 \xd0\xb2\xd0\xb5\xd1\x81\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb9'
2 个回答
0
如果你使用简单的 raw_input(),你得到的只是字节数据:
>>> raw_input('Input: ')
Input: фыв
'\xd1\x84\xd1\x8b\xd0\xb2'
使用 unicode() 可以把输入的字符串转换成Unicode格式
>>> unicode(raw_input('Input: '), encoding='utf-8')
Input: фыв
u'\u0444\u044b\u0432'
然后你就可以尝试使用json了
1
你的终端似乎在处理UTF-8输入时出现了问题;插入了额外的 \dx0
字节:
>>> import difflib
>>> given = '\xd0\x9d\xd0\xb0 \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xd0\xb5 \xd0\xbf\xd0\xbb\xd0\xb0\xd1\x82\xd1\x8f\xd1\x82 \xd0\xb1\xd0\xb0\xd0\xb1\xd0\xbb\xd0\xbe, \xd0\xbd\xd0\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb1\xd0\xbe\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbd\xd0\xb0\xd0\xb4\xd0\xbe \xd0\xbd\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb9'
>>> expected = 'На работе платят бабло, но работать надо на ней' # requires UTF-8 terminal
>>> for opcode in difflib.SequenceMatcher(a=expected, b=given).get_opcodes():
... print "%6s a[%d:%d] b[%d:%d]" % opcode
... if opcode[0] == 'insert': print 'Inserted:', repr(given[opcode[3]:opcode[4]])
...
equal a[0:15] b[0:15]
insert a[15:15] b[15:16]
Inserted: '\xd0'
equal a[15:45] b[16:46]
insert a[45:45] b[46:47]
Inserted: '\xd0'
equal a[45:85] b[47:87]
>>> expected[14:17]
'\x82\xd0\xb5'
>>> given[14:18]
'\x82\xd0\xd0\xb5'
>>> expected[44:47]
'\xbd\xd0\xbe'
>>> given[45:49]
'\xbd\xd0\xd0\xbe'
>>> given = '\xd0\xaf \xd0\xbd\xd0\xb5 \xd0\xbf\xd1\x80\xd0\xbe\xd1\x82\xd0\xb8\xd0\xb2 \xd0\xbf\xd0\xb5\xd1\x80\xd0\xb2\xd0\xbe\xd0\xb3\xd0\xbe \xd0, \xd0\xbd\xd0\xbe \xd0\xb1\xd0\xb5\xd0\xb7 \xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbc\xd0\xbd\xd0\xb5 \xd0\xb2\xd0\xb5\xd1\x81\xd0\xb5\xd0\xbb\xd0\xb5\xd0\xb9'
>>> expected = 'Я не против первого, но без второго мне веселей' # requires UTF-8 terminal
>>> for opcode in difflib.SequenceMatcher(a=expected, b=given).get_opcodes():
... print "%6s a[%d:%d] b[%d:%d]" % opcode
... if opcode[0] == 'insert': print 'Inserted:', repr(given[opcode[3]:opcode[4]])
...
equal a[0:35] b[0:35]
insert a[35:35] b[35:37]
Inserted: ' \xd0'
equal a[35:85] b[37:87]
>>> expected[34:38]
'\xbe, \xd0'
>>> given[34:40]
'\xbe \xd0, \xd0'
在标题中,原本就有一个\xd0
字节的地方又插入了两个额外的\xd0
字节。在描述中,在一个逗号前面,插入了一个空格和一个\xd0
字节,然后是一个空格和\xd0
的序列。
这不是Python的问题,而是你的终端出了故障。为什么会这样,目前还不清楚。