大型Pandas数据框查找重叠区域
我在使用Pandas这个库处理一个数据表,这个表里有基因组区域的信息,包括染色体、起始位置和结束位置。我想找出同一个染色体上重叠的区域,并把它们和对应的标签整理在一起。不过我不太确定我现在的方法是否正确,而且因为我的数据表非常大(有300万行),所以用for循环的方法效率不高,不太合适。
下面是一个示例数据表和我期望得到的结果数据表:
import pandas as pd
# Sample DataFrame
data = {
'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
'start': [10, 15, 35, 45, 55],
'stop': [20, 25, 55, 56, 60],
'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
Output:
overlapping_regions overlapping_labels
0 (chr1:10-20, chr1:15-25) (label1, label2)
1 (chr1:10-20, chr1:35-55) (label1, label2)
2 (chr1:15-25, chr1:35-55) (label2, label2)
3 (chr1:35-55, chr1:45-56) (label2, label3)
4 (chr1:45-56, chr1:55-60) (label3, label1)
1 个回答
0
我觉得你在问题中发的输出结果是错的。你只需要看看区间树和 start
、stop
的值。如果你自己动手做这个练习,你会发现你发的输出结果和实际不符。我建议你试试下面这个方法。
import pandas as pd
from intervaltree import Interval, IntervalTree
data = {
'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
'start': [10, 15, 35, 45, 55],
'stop': [20, 25, 55, 56, 60],
'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
df = pd.DataFrame(data)
def find_overlaps(df):
results = []
for chromosome, group in df.groupby('chromosome'):
tree = IntervalTree()
for _, row in group.iterrows():
tree[row['start']:row['stop']] = (row['hg_38_locs'], row['main_category'])
for interval in tree:
overlaps = tree.overlap(interval.begin, interval.end)
if len(overlaps) > 1:
overlapping_regions = tuple(ov.data[0] for ov in overlaps)
overlapping_labels = tuple(ov.data[1] for ov in overlaps)
if (overlapping_regions, overlapping_labels) not in results:
results.append((overlapping_regions, overlapping_labels))
return pd.DataFrame(results, columns=['overlapping_regions', 'overlapping_labels'])
output_df = find_overlaps(df)
print(output_df)
这样做会得到
overlapping_regions overlapping_labels
0 (chr1:35-55, chr1:45-56) (label2, label3)
1 (chr1:15-25, chr1:10-20) (label2, label1)
2 (chr1:45-56, chr1:35-55, chr1:55-60) (label3, label2, label1)
3 (chr1:45-56, chr1:55-60) (label3, label1)
这个方法即使在处理很大的数据表时也应该能正常工作。如果你觉得速度还是慢的话,可以试试用 concurrent.futures
里的 ProcessPoolExecutor
。