我有一个巨大的输入文件,看起来像这样
contig protein start end
con1 P1 140 602
con1 P2 140 602
con1 P3 232 548
con2 P4 335 801
con2 P5 642 732
con2 P6 335 779
con2 P7 729 812
con3 P8 17 348
con3 P9 16 348
我想删除同源的p或冗余的p,我假设它们分别是那些具有相同起始和终止位点的和具有较小起始或终止位点的。所以我的输出文件是这样的, 文件.txt你知道吗
con1 P1 140 602
con1 P3 232 548
con2 P4 335 801
con2 P7 729 812
尝试脚本,由于某些原因它不满足这两个条件
from itertools import groupby
def non_homolog(hits):
nonhomolog=[]
overst = False
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if p[2] <= c[2] and c[3] <= p[3]:
if not overst: nonhomolog.append(c)
nonhomolog.append(c)
overst = True
return nonhomolog
fh = open('example.txt')
oh = open('nonhomologs.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
hits.sort(key=lambda x: x[2])
if non_homolog(hits):
for hit in hits:
oh.write('\t'.join([str(f) for f in hit])+'\n')
试试这个尺码:
根据给定的数据,生成:
===同系物.txt===
===非同系物.txt===
相关问题 更多 >
编程相关推荐