Unicode引号在Python中自动评估
我现在正在处理一些包含Unicode编码引号的json字符串,格式如下:
'{"test":"\u0022"}'
当把它当作字符串来处理时,结果是这样的:
'{"test":"""}'
这导致在加载时出现一个ValueError错误:
>>> json.loads('{"test":"\u0022"}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.3/json/decoder.py", line 352, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.3/json/decoder.py", line 368, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting ',' delimiter: line 1 column 11 (char 10)
>>>
我可以通过在输入被utf-8编码解释之前,把它当作字节字符串来处理,并进行查找替换来解决这个问题;不过,这对我实际处理的输入来说是不可能的,因为这些输入是通过库查询一个API返回的,而这个API返回的是utf-8编码的字符串。
有没有办法让Python不自动编码这些Unicode字符呢?
4 个回答
你应该得到一个包含json字符串的bytes
对象。你需要对它进行解码,才能用json.loads
来处理。用Python3的话,这样做没问题。
>>> url = "http://api.tumblr.com/v2/blog/distant-traveller.tumblr.com/posts?api_key=IkJtqSbg6Nd3OBnUdaGl9YWE3ocupygJcnPebHRou8eFbd4RUv&id=79086448801"
>>> import json, urllib.request
>>> jdata = urllib.request.urlopen(url).read()
>>> json.loads(jdata.decode())
{'meta': {'msg': 'OK', 'status': 200}, 'response': {'total_posts': 1, 'blog': {'is_nsfw': False, 'ask': True, 'ask_page_title': 'Ask me anything', 'posts': 5152, 'url': 'http://distant-traveller.tumblr.com/', 'name': 'distant-traveller', 'likes': 44022, 'description': '"The surface of the Earth is the shore of the cosmic ocean... Recently, we\'ve managed to wade a little way out, and the water seems inviting." - Carl Sagan', 'share_likes': True, 'updated': 1395784772, 'title': 'Voyage into Space', 'ask_anon': True}, 'posts': [{'source_url': 'http://wonderous-world.com/post/77780009786/starry-sky-and-jupiter-by-timo-braun', 'image_permalink': 'http://distant-traveller.tumblr.com/image/79086448801', 'link_url': 'http://wonderous-world.tumblr.com', 'source_title': 'wonderous-world', 'photos': [{'caption': '', 'alt_sizes': [{'height': 750, 'width': 500, 'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_500.jpg'}, {'height': 600, 'width': 400, 'url': 'http://25.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_400.jpg'}, {'height': 375, 'width': 250, 'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_250.jpg'}, {'height': 150, 'width': 100, 'url': 'http://25.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_100.jpg'}, {'height': 75, 'width': 75, 'url': 'http://24.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_75sq.jpg'}], 'original_size': {'height': 750, 'width': 500, 'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_500.jpg'}}], 'id': 79086448801, 'state': 'published', 'tags': [], 'date': '2014-03-09 20:01:37 GMT', 'timestamp': 1394395297, 'note_count': 7503, 'reblog_key': 'IFKcbmbd', 'short_url': 'http://tmblr.co/ZbkMUw19fwt2X', 'blog_name': 'distant-traveller', 'post_url': 'http://distant-traveller.tumblr.com/post/79086448801/wonderous-world-starry-sky-and-jupiter-by-timo', 'slug': 'wonderous-world-starry-sky-and-jupiter-by-timo', 'type': 'photo', 'caption': '<p><a class="tumblr_blog" href="http://wonderous-world.com/post/77780009786/starry-sky-and-jupiter-by-timo-braun">wonderous-world</a>:</p>\n<blockquote>\n<p><a href="http://www.flickr.com/photos/timobraunphotos/12695374254/">Starry Sky and Jupiter</a> by\xa0<a class="owner-name truncate" href="http://www.flickr.com/photos/timobraunphotos/" title="Go to Timo Braun\'s photostream" data-track="attributionNameClick">Timo Braun</a></p>\n</blockquote>', 'format': 'html', 'highlighted': []}]}}
美化后的版本:
>>> import pprint
>>> pprint.pprint(json.loads(jdata.decode()))
{'meta': {'msg': 'OK', 'status': 200},
'response': {'blog': {'ask': True,
'ask_anon': True,
'ask_page_title': 'Ask me anything',
'description': '"The surface of the Earth is the shore of the cosmic ocean... Recently, we\'ve managed to wade a little way out, and the water seems inviting." - Carl Sagan',
'is_nsfw': False,
'likes': 44022,
'name': 'distant-traveller',
'posts': 5152,
'share_likes': True,
'title': 'Voyage into Space',
'updated': 1395784772,
'url': 'http://distant-traveller.tumblr.com/'},
'posts': [{'blog_name': 'distant-traveller',
'caption': '<p><a class="tumblr_blog" href="http://wonderous-world.com/post/77780009786/starry-sky-and-jupiter-by-timo-braun">wonderous-world</a>:</p>\n<blockquote>\n<p><a href="http://www.flickr.com/photos/timobraunphotos/12695374254/">Starry Sky and Jupiter</a> by\xa0<a class="owner-name truncate" href="http://www.flickr.com/photos/timobraunphotos/" title="Go to Timo Braun\'s photostream" data-track="attributionNameClick">Timo Braun</a></p>\n</blockquote>',
'date': '2014-03-09 20:01:37 GMT',
'format': 'html',
'highlighted': [],
'id': 79086448801,
'image_permalink': 'http://distant-traveller.tumblr.com/image/79086448801',
'link_url': 'http://wonderous-world.tumblr.com',
'note_count': 7503,
'photos': [{'alt_sizes': [{'height': 750,
'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_500.jpg',
'width': 500},
{'height': 600,
'url': 'http://25.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_400.jpg',
'width': 400},
{'height': 375,
'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_250.jpg',
'width': 250},
{'height': 150,
'url': 'http://25.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_100.jpg',
'width': 100},
{'height': 75,
'url': 'http://24.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_75sq.jpg',
'width': 75}],
'caption': '',
'original_size': {'height': 750,
'url': 'http://31.media.tumblr.com/159388efacbfa78e281fd5aa0476864f/tumblr_n1jddlcPjW1r787hmo1_500.jpg',
'width': 500}}],
'post_url': 'http://distant-traveller.tumblr.com/post/79086448801/wonderous-world-starry-sky-and-jupiter-by-timo',
'reblog_key': 'IFKcbmbd',
'short_url': 'http://tmblr.co/ZbkMUw19fwt2X',
'slug': 'wonderous-world-starry-sky-and-jupiter-by-timo',
'source_title': 'wonderous-world',
'source_url': 'http://wonderous-world.com/post/77780009786/starry-sky-and-jupiter-by-timo-braun',
'state': 'published',
'tags': [],
'timestamp': 1394395297,
'type': 'photo'}],
'total_posts': 1}}
你的问题似乎是你在把一个字符串复制粘贴到Python里时,没有处理好特殊字符。其实是Python,而不是json模块,把\u0022
变成了引号。而这种解析只会在字符串的字面量上运行,或者是在eval函数中传入的内容。如果你以正确的方式获取数据,就不会有这个问题:
>>> import requests
>>> resp = requests.get("http://api.tumblr.com/v2/blog/distant-traveller.tumblr.com/posts?api_key=IkJtqSbg6Nd3OBnUdaGl9YWE3ocupygJcnPebHRou8eFbd4RUv&id=79086448801")
>>> json.loads(resp.text)
# Gives data, not an error
如果你确实想把它粘贴到你的源文件里,可以使用原始字符串,这样就可以禁用Python对那个字面量的\u...
解析,这样你在字符串中就会得到那些原始字符,而不是被解码后的单个字符:
>>> json.loads(r'{"test":"\u0022"}')
{'test': '"'}
问题是,你在例子中使用的是字节字符串。你可以选择请求unicode格式,或者像这个例子那样对它们进行解码:
txt = b'{"test":"\u0022"}'
json.loads(txt.decode())
Out[10]: {'test': '"'}
如果你能看到unicode字面量应该是什么样子的,可能会更清楚:
txt.decode()
Out[12]: '{"test":"\\u0022"}'
如果你从API查询中获取字符串,它们已经被正确处理过了。比如,当你在源文件中写
'{"test":"\u0022"}'
时,Python会把\u0022
理解为在字符串中应该包含一个字面上的"
。从正确编写的API代码中获取的字符串,会包含一个字面上的反斜杠u
和一些数字。它的效果就相当于在源文件中写的内容:
'{"test":"\\u0022"}'
如果你的代码在处理API查询返回的实际数据时出错,可能是API本身有问题(这种情况可能会发生,但不太常见),或者你在处理数据时做错了什么,可能是对转义字符进行了重复解析。