雷德萨夫警署和派瑞斯塔警署的行动太慢了。如果必须使用SAV/SPSS文件格式，如何加快大数据的处理速度？

1条回答

网友

1楼 · 发布于 2024-05-16 20:12:42

您可以尝试并行读取文件：

例如，我有一个文件“big.sav”，它是294000行x 666列。使用pyreadstat.read_sav（这是pd.read_spss在后台使用的）读取文件需要115秒。通过并行化，我得到了29秒：

首先，我创建一个worker.py文件：

def worker(inpt):
    import pyreadstat
    offset, chunksize, path = inpt
    df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
    return df

然后在主脚本中我有这样一个：

import multiprocessing as mp
from time import time

import pandas as pd
import pyreadstat

from worker import worker

# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0)  for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)] 
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]

pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)

编辑：

pyreadstat版本1.0.3在性能上有了大约5倍的巨大改进
此外，还添加了一个新函数“read_file_multiprocessing”，该函数是对本答案中共享的先前代码的包装。它可以放弃另一个3倍的改善，使（多达）15倍的改善相比，以前的版本

您可以使用如下功能：

import pyreadstat

fpath = "path/to/file.sav" 
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)

相关问题更多 >

编程相关推荐

热门问题

热门文章

雷德萨夫警署和派瑞斯塔警署的行动太慢了。如果必须使用SAV/SPSS文件格式，如何加快大数据的处理速度？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >