解码包含无效字符的JSON

1 投票
2 回答
2843 浏览
提问于 2025-04-18 08:02

我有一个服务,它从外部服务接收数据(通过一个用作队列的redis列表)。这些数据就是一个简单的JSON格式的字典,举个例子可能是这样的:

{
  "type": "visit",
  "referer": "http://www.google.com/",
  "session_referer": "http://www.google.com/\x0e",
  "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
  "user_ip": "1.2.3.4",
  "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
  "user_locale": "en_US",
}

问题是,正如上面例子所示,有时候referrer或者session_referrer的数据是无效的(也就是说用我预期的编码方式,比如UTF-8、ISO-8859-1等,无法解码)。

我的困扰在于,我无法访问其他的数据。我可以接受referrer的数据有问题,但我还是需要其他的数据。有没有办法可以进行“原始”的解码,而不把数据转成某种特定的编码,然后让我自己处理呢?

2 个回答

1

你可以试着把严格模式设为 false,这样就可以在字符串中使用控制字符了。

https://docs.python.org/2/library/json.html

2

假设你有一个文本文件,里面包含了类似JSON格式的“字符串”,这个字符串里有:

  1. 在“session_referer”这个值中,有一个十六进制的0E字节,
  2. 在最后一个键值对后面多了一个多余的逗号:

npp.png

下面这段Python代码可以去掉这些麻烦的值……

# -*- coding: iso-8859-1 -*-
import json
import re

# retrieve the JSON data into a string
f = open(r'C:\Users\Gord\Desktop\jsonData.txt', 'r')
s = f.read()
f.close()
print '~> raw JSON string'
print s
print

# remove "characters" below \x20 except \n
s = re.sub(r'[\000-\011\013-\037]', '', s)
# remove (extraneous) last comma
s = re.sub(',\n}$', '\n}', s)
print '~> tweaked JSON string'
print s
print

# decode tweaked JSON string
j = json.loads(s)

# see what we got
print '~> decoded result "pretty printed"'
print json.dumps(j, sort_keys=True, indent=4, separators=(',', ': '))
print

# extract just one element
print '~> print just j["user_ip"]'
print j["user_ip"]

……并在Python的IDLE界面中显示出以下结果:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>> 
~> raw JSON string
{
  "type": "visit",
  "referer": "http://www.google.com/",
  "session_referer": "http://www.google.com/♫",
  "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
  "user_ip": "1.2.3.4",
  "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
  "user_locale": "en_US",
}

~> tweaked JSON string
{
  "type": "visit",
  "referer": "http://www.google.com/",
  "session_referer": "http://www.google.com/",
  "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
  "user_ip": "1.2.3.4",
  "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
  "user_locale": "en_US"
}

~> decoded result "pretty printed"
{
    "referer": "http://www.google.com/",
    "session_referer": "http://www.google.com/",
    "type": "visit",
    "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
    "user_ip": "1.2.3.4",
    "user_locale": "en_US",
    "uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97"
}

~> print just j["user_ip"]
1.2.3.4
>>> 

撰写回答