在Python中读取大型JSON文件

1 投票

1 回答

1385 浏览

提问于 2025-04-18 01:38

我有一个很大的JSON文件，大约5GB，但这个文件不是一个完整的JSON，而是几个JSON文件连在一起的。

{"created_at":"Mon Jan 13 20:01:57 +0000 2014","id":422820833807970304,"id_str":"422820833807970304"}
{"created_at":"Mon Jan 13 20:01:57 +0000     2014","id":422820837545500672,"id_str":"422820837545500672"}.....

而且这些JSON之间没有换行，都是紧挨着的。

我试着用sed把大括号替换成换行符，然后用下面的方式读取这个文件：

data=[]
for line in open(filename,'r').readline():
data.append(json.loads(line))

但是这样并没有成功。

我该怎么才能比较快地读取这个文件呢？

非常感谢任何帮助！

文本处理 json解析数据流处理数据格式转换大型文件处理文件读取优化

1 个回答

这其实是一种小技巧。它不会把整个文件都加载到内存里。我真心希望你使用的是Python 3。

DecodeLargeJSON.py

from DecodeLargeJSON import *
import io
import json

# create a file with two jsons
f = io.StringIO()
json.dump({1:[]}, f)
json.dump({2:"hallo"}, f)
print(repr(f.getvalue()))
f.seek(0) 

# decode the file f. f could be any file from here on. f.read(...) should return str
o1, idx1 = json.loads(FileString(f), cls = BigJSONDecoder)
print(o1) # this is the loaded object
# idx1 is the index that the second object begins with
o2, idx2 = json.loads(FileString(f, idx1), cls = BigJSONDecoder)
print(o2)

如果你发现有些对象无法解码，告诉我，我们可以一起找解决办法。

免责声明 这并不是一个有效或最好的解决方案。这只是一个小技巧，展示了怎么可能做到这一点。

讨论由于它不把整个文件加载到内存中，常规的正则表达式就无法使用了。它还使用的是Python的实现，而不是C语言的实现。这可能会让它运行得更慢。我真的很讨厌这个简单的任务变得这么复杂。希望其他人能提出不同的解决方案。

回答于 2025-04-18 由 Python大师

分享举报

在Python中读取大型JSON文件

1 个回答

撰写回答