如何并行化数据帧的apply（）方法

import pandas as pd import time def enrich_str(str): val1 = f'{str}_1' val2 = f'{str}_2' val3 = f'{str}_3' time.sleep(3) return val1, val2, val3 def enrich_row(passed_row): col_name = str(passed_row['colName']) my_string = str(passed_row[col_name]) val1, val2, val3 = enrich_str(my_string) passed_row['enriched1'] = val1 passed_row['enriched2'] = val2 passed_row['enriched3'] = val3 return passed_row df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white', 'blue', 'orange', 'red']}, columns=['numbers', 'colors']) df['colName'] = 'colors' tic = time.perf_counter() enriched_df = df.apply(enrich_row, col_name='colors', axis=1) toc = time.perf_counter() print(f"{df.shape[0]} rows enriched in {toc - tic:0.4f} seconds") enriched_df

import multiprocessing as mp tic = time.perf_counter() pool = mp.Pool(5) result = pool.imap(enrich_row, df.itertuples(), chunksize=1) pool.close() pool.join() toc = time.perf_counter() print(f"{df.shape[0]} rows enriched in {toc - tic:0.4f} seconds") result

2条回答

网友

1楼 · 编辑于 2024-05-16 10:07:40

我接受了@albert的答案，因为它在Linux上工作。不管怎样，我发现Dask dataframe's ^{} method确实向前迈进了。正如我在前面的评论中提到的，最初操作不是在120行的数据集上并行执行的。后来我发现120行只使用了Dask数据帧的一个分区。因此，进行重新分区以获得所需的并行性就足够了Here一个使用Dask的代码示例（它会引发一些奇怪的警告…）

网友

2楼 · 编辑于 2024-05-16 10:07:40

我建议您使用multiprocessing的pathos fork，因为它可以更好地处理数据帧的酸洗imap返回迭代器，而不是数据帧，因此必须将其转换回：

def enrich_row(row_tuple):
    passed_row = row_tuple[1]
    col_name = str(passed_row['colName'])
    my_string = str(passed_row[col_name])
    
    val1, val2, val3 = enrich_str(my_string)
    
    passed_row['enriched1'] = val1
    passed_row['enriched2'] = val2
    passed_row['enriched3'] = val3
    
    return passed_row

df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white', 'blue', 'orange', 'red']}, 
                  columns=['numbers', 'colors'])

df['colName'] = 'colors'

from pathos.multiprocessing import Pool

tic = time.perf_counter()
result = Pool(8).imap(enrich_row, df.iterrows(), chunksize=1)
df = pd.DataFrame(result)
toc = time.perf_counter()

print(f"{df.shape[0]} rows enriched in {toc - tic:0.4f} seconds")
print(df)

注意，我正在使用df.iterrows()返回元组的迭代器(row_number, row)，所以我修改了enrich_row来处理这种格式

相关问题更多 >

编程相关推荐

热门问题

热门文章