使用Python读取多个大型压缩(ZIP格式)CSV文件并提取数据

2024-05-15 22:18:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我从一个贸易数据库下载了世界贸易(出口和进口)数据,按国家和年份,以ZIP文件的形式(从1989年到2020年)。每个ZIP文件表示一年的数据。此外,每个ZIP文件解压成一个非常大的CSV文件(5GB+)。我想阅读所有的ZIP文件,并提取特定国家的某些信息。然而,我下面的代码需要非常长的时间来处理。在一年内,代码运行没有任何问题,但是当结合所有年份时,它的运行速度非常慢。在这个过程的最后,我想把所有的数据帧合并成一个统一的数据帧。任何关于如何加快处理速度以及如何循环所有32个数据帧以合并它们的想法都将不胜感激

2008年CSV文件数据示例:

(见下面的链接) [1] :https://i.stack.imgur.com/V9jDd.png


reporter='Egypt'
import csv
import pandas as pd

filename="All 2020.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df1 = pd.concat(tp, ignore_index=True)
rows = df1[df1['Reporter'] == reporter]
#rows.to_csv('filename.csv')

filename="All 2019.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df2 = pd.concat(tp, ignore_index=True)
rows = df2[df2['Reporter'] == reporter]
#rows.to_csv('filename.csv')

filename="All 2018.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df3 = pd.concat(tp, ignore_index=True)    
rows = df3[df3['Reporter'] == reporter]


filename="All 2017.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df4 = pd.concat(tp, ignore_index=True)    
rows = df4[df4['Reporter'] == reporter]

filename="All 2016.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df5 = pd.concat(tp, ignore_index=True)    
rows = df5[df5['Reporter'] == reporter]


filename="All 2015.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df6 = pd.concat(tp, ignore_index=True)    
rows = df6[df6['Reporter'] == reporter]

filename="All 2014.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df7 = pd.concat(tp, ignore_index=True)    
rows = df7[df7['Reporter'] == reporter]

filename="All 2013.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df8 = pd.concat(tp, ignore_index=True)    
rows = df8[df8['Reporter'] == reporter]


filename="All 2012.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df9 = pd.concat(tp, ignore_index=True)    
rows = df9[df9['Reporter'] == reporter]


filename="All 2011.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df10 = pd.concat(tp, ignore_index=True)    
rows = df10[df10['Reporter'] == reporter]

filename="All 2010.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df11 = pd.concat(tp, ignore_index=True)    
rows = df11[df11['Reporter'] == reporter]

filename="All 2009.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df12 = pd.concat(tp, ignore_index=True)    
rows = df12[df12['Reporter'] == reporter]

filename="All 2008.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df13 = pd.concat(tp, ignore_index=True)    
rows = df13[df13['Reporter'] == reporter]


filename="All 2007.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df14 = pd.concat(tp, ignore_index=True)    
rows = df14[df14['Reporter'] == reporter]

filename="All 2006.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df15 = pd.concat(tp, ignore_index=True)    
rows = df15[df15['Reporter'] == reporter]


filename="All 2005.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df16 = pd.concat(tp, ignore_index=True)    
rows = df16[df16['Reporter'] == reporter]


filename="All 2004.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df17 = pd.concat(tp, ignore_index=True)    
rows = df17[df17['Reporter'] == reporter]


filename="All 2003.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df18 = pd.concat(tp, ignore_index=True)    
rows = df18[df18['Reporter'] == reporter]

filename="All 2002.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df19 = pd.concat(tp, ignore_index=True)    
rows = df19[df19['Reporter'] == reporter]


filename="All 2001.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df20 = pd.concat(tp, ignore_index=True)    
rows = df20[df20['Reporter'] == reporter]

filename="All 2000.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df21 = pd.concat(tp, ignore_index=True)    
rows = df21[df21['Reporter'] == reporter]

filename="All 1999.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df22 = pd.concat(tp, ignore_index=True)    
rows = df22[df22['Reporter'] == reporter]

filename="All 1998.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df23 = pd.concat(tp, ignore_index=True)    
rows = df23[df23['Reporter'] == reporter]

filename="All 1997.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df24 = pd.concat(tp, ignore_index=True)    
rows = df24[df24['Reporter'] == reporter]

filename="All 1996.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df25 = pd.concat(tp, ignore_index=True)    
rows = df25[df25['Reporter'] == reporter]


filename="All 1995.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df26 = pd.concat(tp, ignore_index=True)    
rows = df26[df26['Reporter'] == reporter]

filename="All 1994.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df27 = pd.concat(tp, ignore_index=True)    
rows = df27[df27['Reporter'] == reporter]

filename="All 1993.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df28 = pd.concat(tp, ignore_index=True)    
rows = df28[df28['Reporter'] == reporter]

filename="All 1992.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df29 = pd.concat(tp, ignore_index=True)    
rows = df29[df29['Reporter'] == reporter]


filename="All 1991.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df30 = pd.concat(tp, ignore_index=True)    
rows = df30[df30['Reporter'] == reporter]

filename="All 1990.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df31 = pd.concat(tp, ignore_index=True)    
rows = df31[df31['Reporter'] == reporter]


filename="All 1989.zip"
tp = pd.read_csv(filename, iterator=True, chunksize=10000)
df32 = pd.concat(tp, ignore_index=True)    
rows = df32[df32['Reporter'] == reporter]

Tags: csvtruereadindexreporterallzipfilename
1条回答
网友
1楼 · 发布于 2024-05-15 22:18:44
reporter='Egypt'
import gc
from glob import glob
import pandas as pd
for file in glob('*.zip'):#loop all zip files
    data=[]
    for chunk in pd.read_csv(file, chunksize=10000):
        chunk=chunk[chunk['Reporter']==reporter]#slice data from chunk only
        data.append(chunk)
    df=pd.concat(data)
    df.to_csv(file.split('.')[-2]+'.csv')
    gc.collect() #i am not sure whether it help or not

相关问题 更多 >