Python/Pandas：替换大数据的多个列中的某些值

for chunk in pd.read_csv(filePrev,chunksize=10000,header=None): >>> chunk[chunk[list1] >= 7] = np.nan >>> chunk[chunk[list2] >= 90] = np.nan ... >>> chunk.to_csv(newFile,mode='a',header=False,index=False)

In [11]: df = pd.read_csv(filePrev,nrows=5,usecols=[1,2,3,4,5,6,7],header=None) In [12]: df Out[12]: 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 3 1 1 1 2 1 1 2 3 1 1 1 1 1 1 3 3 1 1 1 2 1 2 4 3 1 1 1 1 1 1 In [13]: list = [1,7] In [14]: df[df[list] > 1] = np.nan In [15]: df Out[15]: 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 NaN 1 1 1 2 1 1 2 NaN 1 1 1 1 1 1 3 NaN 1 1 1 2 1 NaN 4 NaN 1 1 1 1 1 1

2条回答

网友

1楼 · 编辑于 2024-04-25 13:38:22

可以通过保持文件打开而不是每次在追加模式下打开文件来改进这一点：

with open(newFile, 'a') as f:
    for chunk in pd.read_csv(filePrev,chunksize=10000,header=None):
        chunk[chunk[list1] >= 7] = np.nan
        chunk[chunk[list2] >= 90] = np.nan
        chunk.to_csv(f, header=False, index=False)

最近有人在这里报告了这种行为，这一变化使他们在Windows上a 98.3% performance gain（我在osx上只看到了大约25%）。

如果使用Profile或（ipython的）%prun运行python代码，则可以看到调用时间最多、函数调用最多的内容。在question I was referring to above的情况下，大多数时间都花在python的close函数中（除非文件保持打开状态，否则在每次调用pd.read_csv之后关闭）

注意：逻辑看起来没问题，您没有分配给副本。正如您在您的小示例中看到的：代码有效！在

网友

2楼 · 编辑于 2024-04-25 13:38:22

问题在于代码处理某些列。有这样的线索：

chunk[chunk[393] > 50] = np.nan

而不是

^{pr2}$

如果有N：

chunk[393][N] > 50

然后所有的行都用NaN转换成数组

感谢大家的帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章