查找重叠和非重叠区域
我有一个BLAST的表格输出,里面有数百万个结果。Con是我的序列,P是一个蛋白质的结果。我想把下面这三种情况的结果区分开来。每种情况的结果都应该分别保存到三个新的文件里,而且文件1中的内容不能出现在文件2、3等其他文件里。我该怎么做呢?
con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
p1---- p2 ------ p4---
p3-----
con2 --------------------- (only overlapping) con3 ----------------(only non overlp)
p1 ----- p1 ---- p2 -----
p2 -------
p3 -----
如果我知道蛋白质的起始和结束位置,就可以判断它们是否重叠;如果S1 < E2 < S2且E1 < S2 < E2,或者S2 - E1 > 0,就说明有重叠。
我的输入文件是:
contig protein start end
con1 P1 481 931
con1 P2 140 602
con1 P3 232 548
con2 P4 335 406
con2 P5 642 732
con2 P6 2282 2433
con2 P7 729 812
con3 P8 17 148
con3 P9 289 45
我的脚本(这个脚本只会打印出那些没有重叠的结果)
from itertools import groupby
def nonoverlapping(hits):
"""Returns a list of non-overlapping hits."""
nonover = []
overst = False
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if c[2] > p[3]:
if not overst: nonover.append(p)
nonover.append(c)
overst = True
return nonover
fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
for hit in nonoverlapping(hits):
oh.write('\t'.join([str(f) for f in hit])+'\n')
1 个回答
2
我会这样做。首先定义一个“重叠”函数,用来判断两个碰撞是否重叠。接着,检查每个contig(连续的DNA片段)是否全部、部分或者没有重叠。最后,把所有的contig写入到你想要的文件中:
from itertools import groupby
def overlaps(a, b):
result = True
# Supposing a[2] is the start, a[3] the end.
# If end before start, they are not overlapping
if a[3] < b[2] or b[3] < a[2]:
result = False
return result
def test_overlapping(hits):
overlapping = 'None'
overlapping_count = 0
for i in range(len(hits)-1):
if overlaps(hits[i], hits[i+1]):
overlapping_count += 1
if overlapping_count == 0:
overlapping = 'None'
elif overlapping_count == len(hits) -1:
overlapping = 'All'
else:
overlapping = 'Some'
return overlapping
fh = open('file.txt')
file_all = open('result_all.txt', 'w')
file_some = open('result_some.txt', 'w')
file_none = open('result_none.txt', 'w')
line = fh.readline() # quit header
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
overlapping = test_overlapping(hits)
out_file = file_none
if overlapping == 'All':
out_file = file_all
elif overlapping == 'Some':
out_file = file_some
for h in hits:
out_file.write('\t'.join([str(v) for v in h]))
out_file.write('\n')
file_all.close()
file_some.close()
file_none.close()