Pandas在两列之间如果存在反转则删除重复项
我有一个包含两列的数据集,像下面这样...
InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
我正在使用Pandas,想要删除那些出现两次但顺序相反的行,比如下面这样...从这个...
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
变成这个...
InteractorA InteractorB
AGAP002741 AGAP008026
因为从某种意义上来说,它们是一样的。
有没有什么内置的方法可以处理这个问题呢?
4 个回答
0
今天我在找一个类似的问题的解决办法。A.Kot的回答让我找到了正确的方向。下面是一个可以运行的例子。我从EdChum的回答中复制了数据准备的部分。
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
# One liner to drop the duplicates
df.loc[df.apply(lambda x: set(x[['InteractorA', 'InteractorB']]), axis=1).drop_duplicates().index]```
0
我觉得下面的代码应该可以用:
In [37]:
import pandas as pd
import io
temp = """InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
AGAP002741 AGAP008026
AGAP008026 AGAP002741"""
df = pd.read_csv(io.StringIO(temp), sep='\s+')
df
Out[37]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
我下载了你的数据,但误解了你的意思,所以现在下面的代码可以用了:
# first get the values that are unique
In [72]:
df1 = df[~df.InteractorA.isin(df.InteractorB)]
df1.shape
Out[72]:
(2386, 2)
现在我们想找出重复的行,但只保留第一个值:
In [74]:
df2 = df[df.InteractorA.isin(df.InteractorB)]
df2 = df2.groupby('InteractorA').first().reset_index()
df2.shape
Out[74]:
(3074, 2)
现在把这两个数据框合并在一起:
In [75]:
merged = pd.concat([df1, df2], ignore_index=True)
merged.shape
Out[75]:
(5460, 2)
我觉得现在这样是对的。
2
这是我为自己找到的最简单的解决方案。
创建一个列,这一列里的每一行都是一个排好序的列表。
df['sorted_row'] = [sorted([a,b]) for a,b in zip(df.InteractorA, df.InteractorB)]
在列表中不能去掉重复的项,所以这一列应该是一个字符串。
df['sorted_row'] = df['sorted_row'].astype(str)
去掉重复项
df.drop_duplicates(subset=['sorted_row'], inplace=True)
14
我最后写了一个简单的脚本,它会逐行检查数据,看看是否有需要的内容出现,或者它的反向内容是否出现,然后根据情况删除一些行。
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017年更新: 过了几年,积累了一些经验,现在有了一个更优雅的解决方案,适合想找类似方法的人:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026