我有两个非常大的dataframes
,df1
和df2
。其尺寸如下:
print(df1.shape) #444500 x 3062
print(df2.shape) #254232 x 3062
我知道df2
的每个值都出现在df1
中,我想做的是构建一个第三个数据帧,这是两者的区别,也就是说,出现在df1
中的所有行都不会出现在df2
中
我已尝试使用以下方法from this question:
df3 = (pd.merge(df2,df1, indicator=True, how='outer')
.query('_merge=="left_only"').drop('_merge', axis=1))
但是由于这个原因,我不断地得到MemoryError
失败
因此,我现在尝试做以下工作:
就行而言,我所关心的是数据的行是相等的,这意味着所有元素都是成对匹配的
[1,2,3]
[1,2,3]
是匹配项,而:
[1,2,3]
[1,3,2]
不是匹配项吗
我现在正在尝试:
for i in notebook.tqdm(range(svm_data.shape[0])):
real_row = np.asarray(real_data.iloc[[i]].to_numpy())
synthetic_row = np.asarray(svm_data.iloc[[i]].to_numpy())
if (np.array_equal(real_row, synthetic_row)):
continue
else:
list_of_rows.append(list(synthetic_row))
gc.collect()
但由于某些原因,这并不是在行本身中查找值,因此我显然仍然在做一些错误的事情
注意,我也尝试过:
df3 = df1[~df1.isin(df2)].dropna(how='all')
但这产生了错误的结果
我如何(以一种节省内存的方式)找到一个数据帧中的所有行
数据
df1:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,2
df2:
1,0,0.0,0,0,0,0,0,0.0,2
1,0,0.0,0,0,0,0,0,0.0,3
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,2.0,2
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,1,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,8
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,0,0,0,0,0,0.0,4
1,0,0.0,5,0,0,0,0,0.0,4
让我们尝试
concat
和groupby
来识别重复的行:输出:
更新另一个选项是在非重复
df2
上进行合并:输出(您应该能够从中猜出需要哪些行):
相关问题 更多 >
编程相关推荐