如何在列上使用dask groupby分隔文件

| instrument | time | code | val | |------------|------|----------|---------------| | 10 | t1 | c1_at_t1 | v_of_c1_at_t1 | | 10 | t1 | c2_at_t1 | v_of_c2_at_t1 | | 10 | t2 | c1_at_t2 | v_of_c1_at_t2 | | 10 | t2 | c3_at_t2 | v_of_c3_at_t2 | | 11 | t1 | c4_at_t1 | v_of_c4_at_t1 | | 11 | t1 | c5_at_t1 | v_of_c5_at_t1 | | 12 | t2 | c6_at_t2 | v_of_c6_at_t2 | | 13 | t3 | c9_at_t3 | v_of_c9_at_t3 |

time | code | val | -----|----------|---------------| t1 | c1_at_t1 | v_of_c1_at_t1 | t1 | c2_at_t1 | v_of_c2_at_t1 | t2 | c1_at_t2 | v_of_c1_at_t2 | t2 | c3_at_t2 | v_of_c3_at_t2 | t7 | c4_at_t7 | v_of_c4_at_t7 | t9 | c5_at_t9 | v_of_c5_at_t9 |

2条回答

网友

1楼 · 编辑于 2024-04-23 06:20:55

如果每个文件都能放入内存，您可以尝试以下操作：

import dask.dataframe as dd
import pandas as pd
import numpy as np
import os

生成虚拟文件

^{pr2}$

定义函数

对于路径fldr_out/instrument=i/fileN.csv中的每个乐器，以下函数保存到parquet

def fun(x, fn, fldr_out):
    inst = x.instrument.unique()[0]
    filename = os.path.basename(fn)
    fn_out = f"{fldr_out}/instrument={inst}/{filename}"
    fn_out = fn_out.replace(".csv", ".parquet")
    os.makedirs(os.path.dirname(fn_out), exist_ok=True)
    x.drop("instrument", axis=1)\
     .to_parquet(fn_out, index=False)

你可以用它来分组

for f in files:
    fn = f"{fldr_in}/{f}"
    df = pd.read_csv(fn)
    df.groupby("instrument").apply(lambda x: fun(x, fn, fldr_out))

使用dask执行分析

现在您可以使用dask来读取结果并执行分析

df = dd.read_parquet(fldr_out)

网友

2楼 · 编辑于 2024-04-23 06:20:55

我不太清楚你需要达到什么目标，但我认为你不需要任何团队来解决你的问题。在我看来这是一个简单的过滤问题。在

你可以在你的仪器上附加新的文件和文件。在

另外，我没有要实验的示例文件，但我认为您也可以使用chunksize的pandas来读取大型csv文件。在

示例：

import pandas as pd
import glob
import os

# maybe play around to get better performance 
chunksize = 1000000

files = glob.glob('./file_*.csv')
for f in files:

     for chunk in pd.read_csv(f, chunksize=chunksize):
         u_inst = chunk['instrument'].unique()

         for inst in u_inst:
             # filter instrument data
            inst_df = chunk[chunk.instrument == inst]
            # filter columns
            inst_df = inst_df[['time', 'code', 'val']]
            # append to instrument file
            # only write header if not exist yet
            inst_file = f'./instrument_{inst}.csv'
            file_exist = os.path.isfile(inst_file)
            inst_df.to_csv(inst_file, mode='a', header=not file_exist)

生成虚拟文件

定义函数

使用dask执行分析

相关问题更多 >

编程相关推荐

热门问题

热门文章