查找重叠和非重叠区域

1 投票
1 回答
2092 浏览
提问于 2025-04-18 00:59

我有一个BLAST的表格输出,里面有数百万个结果。Con是我的序列,P是一个蛋白质的结果。我想把下面这三种情况的结果区分开来。每种情况的结果都应该分别保存到三个新的文件里,而且文件1中的内容不能出现在文件2、3等其他文件里。我该怎么做呢?

con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
   p1---- p2 ------    p4---
               p3-----
con2 --------------------- (only overlapping)  con3 ----------------(only non overlp)
   p1 -----                                       p1 ----  p2 -----
      p2 -------
          p3 ----- 

如果我知道蛋白质的起始和结束位置,就可以判断它们是否重叠;如果S1 < E2 < S2且E1 < S2 < E2,或者S2 - E1 > 0,就说明有重叠。

我的输入文件是:

contig  protein start end
con1    P1  481 931
con1    P2  140 602
con1    P3  232 548
con2    P4  335 406
con2    P5  642 732
con2    P6  2282    2433
con2    P7  729 812
con3    P8  17  148
con3    P9  289 45

我的脚本(这个脚本只会打印出那些没有重叠的结果)

from itertools import groupby

def nonoverlapping(hits):
    """Returns a list of non-overlapping hits."""
    nonover = []
    overst =  False
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        if c[2] > p[3]:
            if not overst: nonover.append(p)
            nonover.append(c)
            overst = True  
    return nonover

fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        for hit in nonoverlapping(hits):
            oh.write('\t'.join([str(f) for f in hit])+'\n')

1 个回答

2

我会这样做。首先定义一个“重叠”函数,用来判断两个碰撞是否重叠。接着,检查每个contig(连续的DNA片段)是否全部、部分或者没有重叠。最后,把所有的contig写入到你想要的文件中:

from itertools import groupby

def overlaps(a, b):
    result = True
    # Supposing a[2] is the start, a[3] the end. 
    # If end before start, they are not overlapping
    if a[3] < b[2] or b[3] < a[2]:
        result = False

    return result

def test_overlapping(hits):
    overlapping = 'None'
    overlapping_count = 0
    for i in range(len(hits)-1):
        if overlaps(hits[i], hits[i+1]):
            overlapping_count += 1


    if overlapping_count == 0:
        overlapping = 'None'
    elif overlapping_count == len(hits) -1:
        overlapping = 'All'
    else:
        overlapping = 'Some'

    return overlapping


fh = open('file.txt')

file_all = open('result_all.txt', 'w')
file_some = open('result_some.txt', 'w')
file_none = open('result_none.txt', 'w')

line = fh.readline() # quit header

for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        overlapping = test_overlapping(hits)
        out_file = file_none
        if overlapping == 'All':
            out_file = file_all
        elif overlapping == 'Some':
            out_file = file_some

        for h in hits:
            out_file.write('\t'.join([str(v) for v in h]))
            out_file.write('\n')


file_all.close()
file_some.close()
file_none.close()

撰写回答