用biopython处理gff文件

Question

我有一个GFF文件，这个文件是一个有9列的制表符分隔文件。我的GFF文件长这样：

chr1    GenBank region  1   2821361 .   +   1   ID=CP000253.1
chr1    S-MART  utr5    313 516     .   +   .   ID=CP000253.1|+313..516
chr1    GenBank gene    517 1878    .   +   1   ID=SAOUHSC_00001

......... 还有很多类似的内容。

问题描述：

现在，我想合并那些满足特定条件的行。这个条件是：第5列的值必须等于第i+1行的第4列值减去1。

所以最终的结果应该是这样的：

chr1    GenBank region  1   2821361 .   +   1   ID=CP000253.1
chr1    predict TU      313 1878    .   +   1   ID=SAOUHSC_00001

为了解决这个问题，我写了下面这个程序：

from BCBio import GFF
from Bio.SeqFeature import SeqFeature, FeatureLocation

in_file = "infile.gff"
out_file = "outfile.gff"

limit_info = dict(
        gff_type = ['CDS','exon','gene','mRNA','operon','rRNA','tRNA','utr3','utr5'])
new_qualifiers = {"source": "prediction","ID": "CP000253.1"}
new_sub_qualifiers = {"source": "prediction"}
new_top_feature = SeqFeature(FeatureLocation(0, 2821361), type="genomeRegion", strand=1,
                         qualifiers=new_qualifiers)
i=0

in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
    for i in range(10):
        if rec.features[i].location.end == rec.features[i+1].location.start :
            # print rec.features[i]
            new_top_feature.sub_features[i] =     
[SeqFeature(FeatureLocation(rec.features[i].location.start ,  
rec.features[i+1].location.end ,strand=rec.features[i].strand),  
type="Transcription_unit",  qualifiers=new_sub_qualifiers)]             

in_handle.close()

rec.features = [new_top_feature]

with open(out_file, "w") as out_handle:
    GFF.write([rec], out_handle)

但是我遇到了以下错误：

/usr/lib/python2.7/dist-packages/Bio/SeqFeature.py:171: BiopythonDeprecationWarning: Rather using f.sub_features, f.location should be a CompoundFeatureLocation
  BiopythonDeprecationWarning)
Traceback (most recent call last):
  File "/home/nkumar/workplacekepler/random/src/limit.py", line 26, in <module>
    new_top_feature.sub_features[i] = [SeqFeature(FeatureLocation(rec.features[i].location.start , rec.features[i+1].location.end ,strand=rec.features[i].strand), type="Transcription_unit",  qualifiers=new_sub_qualifiers)]
IndexError: list assignment index out of range

虽然这是一个索引超出范围的错误，但我还是搞不清楚哪里出了问题。

in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
    for i in range(10):        
        if rec.features[i].location.end == rec.features[i+1].location.start :
            print 1          
        else:
            print rec.features[i]            
in_handle.close()

这个程序运行得很好，并且打印出了所有的特征。

条件筛选数据处理索引错误特征提取生物信息学行合并 gff文件

用biopython处理gff文件

1 个回答

撰写回答