Pandas在两列之间如果存在反转则删除重复项

9 投票
4 回答
5155 浏览
提问于 2025-04-18 12:47

我有一个包含两列的数据集,像下面这样...

InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414

我正在使用Pandas,想要删除那些出现两次但顺序相反的行,比如下面这样...从这个...

InteractorA InteractorB
AGAP002741  AGAP008026
AGAP008026  AGAP002741

变成这个...

InteractorA InteractorB
AGAP002741  AGAP008026

因为从某种意义上来说,它们是一样的。

有没有什么内置的方法可以处理这个问题呢?

4 个回答

0

今天我在找一个类似的问题的解决办法。A.Kot的回答让我找到了正确的方向。下面是一个可以运行的例子。我从EdChum的回答中复制了数据准备的部分。

import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')

# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```
0

我觉得下面的代码应该可以用:

In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204  AGAP005846
AGAP028204  AGAP003428
AGAP028200  AGAP011124
AGAP028200  AGAP004335
AGAP028200  AGAP011356
AGAP028194  AGAP008414
AGAP002741  AGAP008026
AGAP008026  AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

我下载了你的数据,但误解了你的意思,所以现在下面的代码可以用了:

# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)

现在我们想找出重复的行,但只保留第一个值:

In [74]:

df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)

现在把这两个数据框合并在一起:

In [75]:

merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)

我觉得现在这样是对的。

2

这是我为自己找到的最简单的解决方案。

创建一个列,这一列里的每一行都是一个排好序的列表。

df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]

在列表中不能去掉重复的项,所以这一列应该是一个字符串。

df['sorted_row'] = df['sorted_row'].astype(str)

去掉重复项

df.drop_duplicates(subset=['sorted_row'], inplace=True)
14

我最后写了一个简单的脚本,它会逐行检查数据,看看是否有需要的内容出现,或者它的反向内容是否出现,然后根据情况删除一些行。

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
    check_string = row['InteractorA'] + row['InteractorB']
    check_string_rev = row['InteractorB'] + row['InteractorA']
    if (check_string or check_string_rev) in checklist:
        indexes_to_drop.append(index)
    else:
        pass
    checklist.append(check_string)
    checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017年更新: 过了几年,积累了一些经验,现在有了一个更优雅的解决方案,适合想找类似方法的人:

In [8]: df
Out[8]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
7  AGAP008026  AGAP002741  AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026

撰写回答