如何从JSON获取字符串对象而不是Unicode

302 投票

21 回答

393635 浏览

数据工程师

提问于 2025-04-15 12:04

我正在使用Python 2来解析从ASCII编码的文本文件中读取的JSON数据。

当我用json或者simplejson加载这些文件时，所有的字符串值都被转换成了Unicode对象，而不是字符串对象。问题是，我需要用这些数据和一些只接受字符串对象的库一起使用。我不能更改这些库，也不能更新它们。

有没有办法得到字符串对象，而不是Unicode对象呢？

示例

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

(一个简单而干净的解决方案是使用更新版本的Python——也就是Python 3及以后的版本。)

unicode json ascii string manipulation data parsing simplejson string encoding python 2

21 个回答

146

没有内置的选项可以让 json 模块的函数返回字节字符串，而不是Unicode字符串。不过，这里有一个简单的递归函数，可以把任何解码后的JSON对象从使用Unicode字符串转换为UTF-8编码的字节字符串：

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在你从 json.load 或 json.loads 得到的输出上调用这个函数即可。

几点说明：

为了支持Python 2.6或更早版本，可以把 return {byteify(key): byteify(value) for key, value in input.iteritems()} 替换为 return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()])，因为字典推导式在Python 2.7之前是不支持的。
由于这个答案会递归遍历整个解码后的对象，所以它有一些不太理想的性能特点。如果非常小心地使用 object_hook 或 object_pairs_hook 参数，可以避免这些问题。Mirec Miskuf的答案是目前唯一一个能正确实现这一点的答案，不过因此它的复杂度比我的方法要高得多。

回答于 2025-04-15 由 Python大师

分享举报

188

虽然这里有一些不错的回答，但我最终选择使用 PyYAML 来解析我的 JSON 文件，因为它返回的键和值都是 str 类型的字符串，而不是 unicode 类型。因为 JSON 是 YAML 的一个子集，所以它的工作效果很好：

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

注意事项

不过，有几点需要注意：

我得到的是 字符串对象，因为我所有的条目都是 ASCII 编码。如果我使用的是 Unicode 编码的条目，我得到的就是 unicode 对象 —— 这里没有转换！
你应该（可能总是）使用 PyYAML 的 safe_load 函数；如果你用它来加载 JSON 文件，其实不需要 load 函数的“额外功能”。
如果你想要一个对 1.2 版本规范支持更好的 YAML 解析器（并且能正确解析非常小的数字），可以试试 Ruamel YAML：只需 pip install ruamel.yaml 和 import ruamel.yaml as yaml 就可以在我的测试中使用了。

转换

如前所述，这里没有任何转换！如果你不能确保只处理 ASCII 值（而且大多数时候你也不能确保），最好使用一个 转换函数：

我用过 Mark Amery 提供的函数几次，它效果很好且非常简单易用。你也可以使用类似的函数作为 object_hook，这样在处理大文件时可能会提高性能。有关更多信息，可以查看 Mirec Miskuf 的稍微复杂一点的回答。

回答于 2025-04-15 由 Python大师

分享举报

116

使用 `object_hook` 的解决方案

这个方法适用于 Python 2.7 和 3.x。

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    if isinstance(data, str):
        return data

    # If this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # If this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.items() # changed to .items() for Python 2.7/3
        }

    # Python 3 compatible duck-typing
    # If this is a Unicode string, return its string representation
    if str(type(data)) == "<type 'unicode'>":
        return data.encode('utf-8')

    # If it's anything else, return it in its original form
    return data

示例用法：

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

这个方法是怎么工作的？我为什么要用它？

Mark Amery 的函数比这些方法更简短、更清晰，那这些方法有什么意义呢？为什么你会想用它们呢？

主要是为了性能。Mark 的答案首先用 Unicode 字符串完全解码 JSON 文本，然后再遍历整个解码后的值，把所有字符串转换成字节字符串。这会带来几个不太好的效果：

会在内存中创建整个解码结构的一个副本
如果你的 JSON 对象嵌套得非常深（500层或更多），那么你会遇到 Python 的最大递归深度限制

这个答案通过使用 json.load 和 json.loads 的 object_hook 参数来解决这两个性能问题。从文档中可以看到：

object_hook 是一个可选的函数，它会在解码任何对象字面量（一个 dict）时被调用。object_hook 的返回值将替代原来的 dict。这个功能可以用来实现自定义解码器。

由于嵌套在其他字典中的字典在解码时会被传递给 object_hook，我们可以在这个时候把里面的字符串或列表转换成字节字符串，从而避免后面需要深度递归。

Mark 的答案本身不适合用作 object_hook，因为它会递归进入嵌套的字典。我们在这个答案中通过给 _byteify 传递 ignore_dicts 参数来防止这种递归，这个参数在所有情况下都会传递给它，除了当 object_hook 传递一个新的 dict 给它进行字节转换时。ignore_dicts 标志告诉 _byteify 忽略 dict，因为它们已经被转换过了。

最后，我们对 json_load_byteified 和 json_loads_byteified 的实现会在从 json.load 或 json.loads 返回的结果上调用 _byteify（并设置 ignore_dicts=True），以处理解码的 JSON 文本在顶层没有 dict 的情况。

回答于 2025-04-15 由 Python大师

分享举报

如何从JSON获取字符串对象而不是Unicode

示例

21 个回答

注意事项

转换

使用 object_hook 的解决方案

这个方法是怎么工作的？我为什么要用它？

撰写回答

使用 `object_hook` 的解决方案