大型Pandas数据框查找重叠区域

2 投票
1 回答
35 浏览
提问于 2025-04-14 18:19

我在使用Pandas这个库处理一个数据表,这个表里有基因组区域的信息,包括染色体、起始位置和结束位置。我想找出同一个染色体上重叠的区域,并把它们和对应的标签整理在一起。不过我不太确定我现在的方法是否正确,而且因为我的数据表非常大(有300万行),所以用for循环的方法效率不高,不太合适。

下面是一个示例数据表和我期望得到的结果数据表:

import pandas as pd

# Sample DataFrame
data = {
    'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'start': [10, 15, 35, 45, 55],
    'stop': [20, 25, 55, 56, 60],
    'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
    'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}

Output:

     overlapping_regions              overlapping_labels
0    (chr1:10-20, chr1:15-25)        (label1, label2)
1    (chr1:10-20, chr1:35-55)        (label1, label2)
2    (chr1:15-25, chr1:35-55)        (label2, label2)
3    (chr1:35-55, chr1:45-56)        (label2, label3)
4    (chr1:45-56, chr1:55-60)        (label3, label1)

1 个回答

0

我觉得你在问题中发的输出结果是错的。你只需要看看区间树和 startstop 的值。如果你自己动手做这个练习,你会发现你发的输出结果和实际不符。我建议你试试下面这个方法。

import pandas as pd
from intervaltree import Interval, IntervalTree

data = {
    'chromosome': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'start': [10, 15, 35, 45, 55],
    'stop': [20, 25, 55, 56, 60],
    'hg_38_locs': ['chr1:10-20', 'chr1:15-25', 'chr1:35-55', 'chr1:45-56', 'chr1:55-60'],
    'main_category': ['label1', 'label2', 'label2', 'label3', 'label1']
}
df = pd.DataFrame(data)

def find_overlaps(df):
    results = []
    for chromosome, group in df.groupby('chromosome'):
        tree = IntervalTree()
        for _, row in group.iterrows():
            tree[row['start']:row['stop']] = (row['hg_38_locs'], row['main_category'])

        for interval in tree:
            overlaps = tree.overlap(interval.begin, interval.end)
            if len(overlaps) > 1:
                overlapping_regions = tuple(ov.data[0] for ov in overlaps)
                overlapping_labels = tuple(ov.data[1] for ov in overlaps)
                if (overlapping_regions, overlapping_labels) not in results:
                    results.append((overlapping_regions, overlapping_labels))

    return pd.DataFrame(results, columns=['overlapping_regions', 'overlapping_labels'])

output_df = find_overlaps(df)
print(output_df)

这样做会得到

                    overlapping_regions        overlapping_labels
0              (chr1:35-55, chr1:45-56)          (label2, label3)
1              (chr1:15-25, chr1:10-20)          (label2, label1)
2  (chr1:45-56, chr1:35-55, chr1:55-60)  (label3, label2, label1)
3              (chr1:45-56, chr1:55-60)          (label3, label1)

这个方法即使在处理很大的数据表时也应该能正常工作。如果你觉得速度还是慢的话,可以试试用 concurrent.futures 里的 ProcessPoolExecutor

撰写回答