我正试图用python解析一个12 GB的JSON文件,其中包含近500万行(每行都是一个对象),并将其存储到数据库中。我正在使用ijson和多处理来加快运行速度。这是密码
def parse(paper):
global mydata
if 'type' not in paper["venue"]:
venue = Venues(venue_raw = paper["venue"]["raw"])
venue.save()
else:
venue = Venues(venue_raw = paper["venue"]["raw"], venue_type = paper["venue"]["type"])
venue.save()
paper1 = Papers(paper_id = paper["id"],paper_title = paper["title"],venue = venue)
paper1.save()
paper_authors = paper["authors"]
paper_authors_json = json.dumps(paper_authors)
obj = ijson.items(paper_authors_json,'item')
for author in obj:
mydata = mydata.append({'author_id': author["id"] , 'venue_raw': venue.venue_raw, 'year' : paper["year"],'number_of_times': 1},ignore_index=True)
if __name__ == '__main__':
p = Pool(4)
filename = 'C:/Users/dintz/Documents/finaldata/dblp.v12.json'
with open(filename,encoding='UTF-8') as infile:
papers = ijson.items(infile, 'item')
for paper in papers:
p.apply_async(parse,(paper,))
p.close()
p.join()
mydata = mydata.groupby(by=['author_id','venue_raw','year'], axis=0, as_index = False).sum()
mydata = mydata.groupby(by = ['author_id','venue_raw'], axis=0, as_index = False, group_keys = False).apply(lambda x: sum((1+x.year-x.year.min())*numpy.log10(x.number_of_times+1)))
df = mydata.index.to_frame(index = False)
df = pd.DataFrame({'author_id':df["author_id"],'venue_raw':df["venue_raw"],'rating':mydata.values[:,2]})
for index, row in df.iterrows():
author_id = row['author_id']
venue = Venues.objects.get(venue_raw = row['venue_raw'])
rating = Ratings(author_id = author_id, venue = venue, rating = row['rating'])
rating.save()
有人能帮我吗
我不得不做出一些推断和假设,但看起来
填充您的SQL数据库可以通过如下方式非常灵活地完成
tqdm
包,因此您可以获得进度指示李>PaperAuthor
模型李>Venue
李>get_or_create
和create
,使其在没有数据库模型(或者实际上,没有Django)的情况下可以运行,只需使用the dataset you're using李>在我的机器上,这几乎不消耗内存,因为记录被(或将被)转储到SQL数据库中,而不是内存中不断增长、碎片化的数据帧中
熊猫处理留给读者作为练习;-),但是我可以想象,从数据库中读取这些预处理数据需要
pd.read_sql()
相关问题 更多 >
编程相关推荐