将两个数据帧转换为numpy数组以进行成对比较

2024-04-27 17:58:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个非常大的dataframesdf1df2。其尺寸如下:

print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062

我知道df2的每个值都出现在df1中,我想做的是构建一个第三个数据帧,这是两者的区别,也就是说,出现在df1中的所有行都不会出现在df2

我已尝试使用以下方法from this question

df3 = (pd.merge(df2,df1, indicator=True, how='outer')
            .query('_merge=="left_only"').drop('_merge', axis=1))

但是由于这个原因,我不断地得到MemoryError失败

因此,我现在尝试做以下工作:

  1. 循环遍历df1的每一行
  2. 查看df1是否出现在df2中
  3. 如果有,跳过
  4. 如果没有,请将其添加到列表中

就行而言,我所关心的是数据的是相等的,这意味着所有元素都是成对匹配的

[1,2,3]
[1,2,3]

是匹配项,而:

[1,2,3]
[1,3,2]

不是匹配项吗

我现在正在尝试:

for i in notebook.tqdm(range(svm_data.shape[0])):
    real_row = np.asarray(real_data.iloc[[i]].to_numpy())
    synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
    if (np.array_equal(real_row, synthetic_row)):
        continue
    else:
        list_of_rows.append(list(synthetic_row))
    gc.collect()

但由于某些原因,这并不是在行本身中查找值,因此我显然仍然在做一些错误的事情

注意,我也尝试过: df3 = df1[~df1.isin(df2)].dropna(how='all')

但这产生了错误的结果

我如何(以一种节省内存的方式)找到一个数据帧中的所有行

数据

df1:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2

df2:

1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4

Tags: 数据datanp原因mergerealhowrow
1条回答
网友
1楼 · 发布于 2024-04-27 17:58:18

让我们尝试concatgroupby来识别重复的行:

# sample data
df1 = pd.DataFrame([[1,2,3],[1,2,3],[4,5,6],[7,8,9]])
df2 = pd.DataFrame([[4,5,6],[7,8,9]])

s = (pd.concat((df1,df2), keys=(1,2))
       .groupby(list(df1.columns))
       .ngroup()
    )

# `s.loc[1]` corresponds to rows in df1
# `s.loc[2]` corresponds to rows in df2
df1_in_df2 = s.loc[1].isin(s.loc[2])

df1[df1_in_df2]

输出:

   0  1  2
2  4  5  6
3  7  8  9

更新另一个选项是在非重复df2上进行合并:

df1.merge(df2.drop_duplicates(), on=list(df1.columns), indicator=True, how='left')

输出(您应该能够从中猜出需要哪些行):

   0  1  2     _merge
0  1  2  3  left_only
1  1  2  3  left_only
2  4  5  6       both
3  7  8  9       both

相关问题 更多 >