如何解码三次字节编码字符串？

0 投票

1 回答

41 浏览

提问于 2025-04-14 15:22

我在使用pandas这个库处理数据的时候，有一列数据是以字节形式编码的。我第一次用.decode('utf-8')解码，结果大部分数据都能正常显示，但有些字符串似乎被编码了不止一次。比如说，有个字符串是这样的：b'b\'[{"charcName":"\\\\u0420\\\\u0438\\\\u0441\\\\u0443\\\\u043d\\\\u043e\\\\u043a","charcValues":["\\\\u043c\\\\u0438\\\\u043b\\\\u0438\\\\u0442\\\\u0430\\\\u0440\\\\u0438 \\\\u043a\\\\u0430\\\\u043c\\\\u0443\\\\u0444\\\\u043b\\\\u044f\\\\u0436"]}]\'''

我尝试了一次又一次地解码（同时也进行了编码，以避免出现'字符串'对象没有'decode'这个属性的错误），但似乎没有效果。那我该怎么才能完全解码这些字符串呢？utf-8和unicode_escape的解码顺序应该怎么安排呢？

数据处理 utf-8 数据清洗 pandas 字节解码 unicode_escape 字符串编码多次解码

1 个回答

原来的字符串不太对劲，所以我去掉了一层坏掉的字节装饰，然后专注于解码剩下的部分。这样做在其他条目上是行不通的，因为我手动去掉了无效字符串的坏部分。告诉上游的开发者去修复这个问题。

import ast
import json

s = b'b\'[{"charcName": "\\\\u0420\\\\u0438\\\\u0441\\\\u0443\\\\u043d\\\\u043e\\\\u043a", "charcValues": ["\\\\u043c\\\\u0438\\\\u043b\\\\u0438\\\\u0442\\\\u0430\\\\u0440\\\\u0438 \\\\u043a\\\\u0430\\\\u043c\\\\u0443\\\\u0444\\\\u043b\\\\u044f\\\\u0436"]}]\''
s = ast.literal_eval(s.decode())
s = ast.literal_eval(s.decode())

print('# Original object:')
print(s)
print('\n# Properly encoded in JSON (tell the hacks of the original data how to do it):')
print(json.dumps(s))
print('\n# Or this, but make sure to write this to a UTF-8-encoded database or file.')
print(json.dumps(s, ensure_ascii=False))

输出：

# Original object:
[{'charcName': 'Рисунок', 'charcValues': ['милитари камуфляж']}]

# Properly encoded in JSON (tell the hacks of the original data how to do it):
[{"charcName": "\u0420\u0438\u0441\u0443\u043d\u043e\u043a", "charcValues": ["\u043c\u0438\u043b\u0438\u0442\u0430\u0440\u0438 \u043a\u0430\u043c\u0443\u0444\u043b\u044f\u0436"]}]

# Or this, but make sure to write this to a UTF-8-encoded database or file.
[{"charcName": "Рисунок", "charcValues": ["милитари камуфляж"]}]

回答于 2025-04-14 由 Python大师

分享举报

如何解码三次字节编码字符串？

1 个回答

撰写回答