将JSON转换为一组整洁的CSV文件
tidy-json-to-csv的Python项目详细描述
整理json到csv
将JSON的一个子集转换为一组整洁的csv。支持输入JSON和CSV输出的流式处理,因此适合内存受限环境中的大文件。在
这解决了什么问题?在
大多数JSON到CSV的转换器不会产生适合立即分析的数据。它们通常输出一个CSV,为此,会产生以下组合:
- CSV字段内的JSON
- 列表中以列表示的值
- 在多行中复制的数据/CSV中一行的位置决定了它的上下文。在
这些操作通常需要后续的手动操作,因此容易出错。这个库的目标是预先完成所有的转换,因此最终得到一组tidy表,这通常是开始分析的好地方。在
输入和输出示例
JSON
{"songs":[{"id":"1","title":"Walk through the fire","categories":[{"id":"1","name":"musicals"},{"id":"2","name":"television-shows"}],"comments":[{"content":"I love it"},{"content":"I've heard better"}],"artist":{"name":"Slayer"}},{"id":"2","title":"I could have danced all night","categories":[{"id":"1","name":"musicals"},{"id":"3","name":"films"}],"comments":[{"content":"I also could have danced all night"}],"artist":{"name":"Doolitle"}}]}
映射到四个文件:
songs.csv
^{pr2}$
songs__categories__id.csv
"songs__id","categories__id" "1","1" "1","2" "2","1" "2","3"
songs__comments.csv
"songs__id","content" "1","I love it" "1","I've heard better" "2","I also could have danced all night"
categories.csv
"id","name" "1","musicals" "2","television-shows" "3","films"
安装
pip install tidy-json-to-csv
用法:将JSON转换为多个CSV文件(命令行)
cat songs.json | tidy_json_to_csv
用法:将JSON转换为多个CSV文件(Python)
fromtidy_json_to_csvimportto_csvs# A save function, called by to_csvs for each CSV file to be generated.# Will be run in a separate thread, started by to_csvsdefsave_csv_bytes(path,chunks):withopen(f'{path}.csv','wb')asf:forchunkinchunks:f.write(chunk)defjson_bytes():withopen(f'file.json','rb')asf:chunk=f.read(65536)ifchunk:yieldchunkto_csvs(json_bytes(),save_csv_bytes,null='#NA',output_chunk_size=65536)
用法:将JSON转换为多个Pandas数据帧(Python)
importioimportqueueimportpandasaspdfromtidy_json_to_csvimportto_csvsdefjson_to_pandas(json_filename):q=queue.Queue()classStreamedIterable(io.RawIOBase):def__init__(self,iterable):self.iterable=iterableself.remainder=b''defreadable(self):returnTruedefreadinto(self,b):buffer_size=len(b)whilelen(self.remainder)<buffer_size:try:self.remainder=self.remainder+next(self.iterable)exceptStopIteration:ifself.remainder:breakreturn0chunk,self.remainder=self.remainder[:buffer_size],self.remainder[buffer_size:]b[:len(chunk)]=chunkreturnlen(chunk)defsave_csv_bytes(path,chunks):q.put((path,pd.read_csv(io.BufferedReader(StreamedIterable(chunks),buffer_size=65536),na_values=['#NA'])))defjson_bytes():withopen(json_filename,'rb')asf:chunk=f.read(65536)ifchunk:yieldchunkto_csvs(json_bytes(),save_csv_bytes,null='#NA')dfs={}whilenotq.empty():path,df=q.get()dfs[path]=dfreturndfsdfs=json_to_pandas('songs.json')forpath,dfindfs.items():print(path)print(df)
限制条件
假定非规范化输入JSON,并且输出被规范化。如果一个嵌套对象有一个id
字段,则假定它是顶级表的主键。具有嵌套对象或数组的所有对象都必须有一个id
字段作为其最终输出中的主键。如果存在,id
必须是映射中的first键。所有数组必须是对象数组,而不是基元数组。在
尽管主要是流式处理,但为了支持非规范化的输入JSON并避免在标准化csv中重复相同的行,在处理期间会维护输出id的内部记录。在
- 项目
标签: