python中大数据集的多处理

代码：

file1=open('./R.csv','r').readlines() file2=open('./N.csv','r').readlines()

定义字典：

Dict1={} Dict2={}

将file1第一列存储为字典元素：

for k1 in range(0,len(file1)): d1=file2[k1].split(',')[0] Dict1[k1]=d1 #print(Dict1[1])

将file2第一列存储为字典元素：

for k2 in range(0,len(file2)): d2=file2[k2].split(',')[0] Dict2[k2]=d2 #print(new_Dict[0])

要逐行检查Dict1中的元素是否与Dict2相同，如果相同，请打印file1和file2中的匹配行：

for i in range(0,len(file1)): for j in range(0,len(file2)): if Dict1[i] in Dict2[j]: print(Dict1[i]+","+file1[i].split(',')[1].strip()+","+file2[j].split(',')[1].strip())

这段代码可以工作，但由于两个文件都是一个巨大的数据集，所以完成这项工作需要花费大量时间。我想使用服务器工作站上所有的64个CPU集群。但不知道怎么。。。你知道吗

我试着按照下面的链接，但不知怎的卡住了。你知道吗

https://stackoverflow.com/questions/914821/producer-consumer-problem-with-python-multiprocessing https://www.youtube.com/watch?v=sp7EhjLkFY4 https://www.youtube.com/watch?v=aysceqdGFw8

非常感谢您的帮助。你知道吗

非常感谢。干杯。你知道吗

1条回答

网友

1楼 · 发布于 2024-04-20 13:23:11

首先我会用熊猫来测试：

import pandas as pd

df_r = pd.read_table('./R.csv', header=None)   # check if standard delimiter ',' works...
df_n = pd.read_table('./N.csv', header=None)   # ... otherwise add e.g. sep='\s+,\s+'

print(df_r[df_r[0].isin(df_n[1])])

也许这是一种已经对你有效的方法。你知道吗

代码：

定义字典：

将file1第一列存储为字典元素：

将file2第一列存储为字典元素：

要逐行检查Dict1中的元素是否与Dict2相同，如果相同，请打印file1和file2中的匹配行：

相关问题更多 >

编程相关推荐

热门问题

热门文章