如何在Python中流式传输和操作大型数据文件

2条回答

网友

1楼 · 编辑于 2024-06-16 10:42:03

您可以使用^{}，这在语法上与pandas类似，但在核心之外执行操作，因此内存不应该是问题：

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

或者，如果pandas是一个需求，那么可以使用@chrisaycock提到的分块读取。您可能需要尝试chunksize参数。

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

网友

2楼 · 编辑于 2024-06-16 10:42:03

我确实喜欢@root的解决方案，但我会进一步优化内存使用率—只在内存中保留聚合的DF，只读取那些列，这是您真正需要的：

cols = ['Geography','Count']
df = pd.DataFrame()

chunksize = 2   # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
                          usecols=cols,
                          chunksize=chunksize)
             ):
    # merge previously aggregated DF with a new portion of data and aggregate it again
    df = (pd.concat([df,
                     chunk.groupby('Geography')['Count'].sum().to_frame()])
            .groupby(level=0)['Count']
            .sum()
            .to_frame()
         )

df.reset_index().to_csv('c:/temp/result.csv', index=False)

测试数据：

Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111

输出.csv：

Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111

PS使用这种方法可以处理大量文件。

除非您需要对数据进行排序，否则使用分块方法的PPS应该可以工作——在本例中，我将使用经典的UNIX工具，如awk、sort等，首先对数据进行排序

我还建议使用PyTables（HDF5存储），而不是CSV文件-它非常快，允许有条件地读取数据（使用where参数），因此它非常方便，节省了大量资源，通常与CSV相比much faster。

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在Python中流式传输和操作大型数据文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >