用biopython处理gff文件
我有一个GFF文件,这个文件是一个有9列的制表符分隔文件。我的GFF文件长这样:
chr1 GenBank region 1 2821361 . + 1 ID=CP000253.1
chr1 S-MART utr5 313 516 . + . ID=CP000253.1|+313..516
chr1 GenBank gene 517 1878 . + 1 ID=SAOUHSC_00001
......... 还有很多类似的内容。
问题描述:
现在,我想合并那些满足特定条件的行。这个条件是:第5列的值必须等于第i+1行的第4列值减去1。
所以最终的结果应该是这样的:
chr1 GenBank region 1 2821361 . + 1 ID=CP000253.1
chr1 predict TU 313 1878 . + 1 ID=SAOUHSC_00001
为了解决这个问题,我写了下面这个程序:
from BCBio import GFF
from Bio.SeqFeature import SeqFeature, FeatureLocation
in_file = "infile.gff"
out_file = "outfile.gff"
limit_info = dict(
gff_type = ['CDS','exon','gene','mRNA','operon','rRNA','tRNA','utr3','utr5'])
new_qualifiers = {"source": "prediction","ID": "CP000253.1"}
new_sub_qualifiers = {"source": "prediction"}
new_top_feature = SeqFeature(FeatureLocation(0, 2821361), type="genomeRegion", strand=1,
qualifiers=new_qualifiers)
i=0
in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
for i in range(10):
if rec.features[i].location.end == rec.features[i+1].location.start :
# print rec.features[i]
new_top_feature.sub_features[i] =
[SeqFeature(FeatureLocation(rec.features[i].location.start ,
rec.features[i+1].location.end ,strand=rec.features[i].strand),
type="Transcription_unit", qualifiers=new_sub_qualifiers)]
in_handle.close()
rec.features = [new_top_feature]
with open(out_file, "w") as out_handle:
GFF.write([rec], out_handle)
但是我遇到了以下错误:
/usr/lib/python2.7/dist-packages/Bio/SeqFeature.py:171: BiopythonDeprecationWarning: Rather using f.sub_features, f.location should be a CompoundFeatureLocation
BiopythonDeprecationWarning)
Traceback (most recent call last):
File "/home/nkumar/workplacekepler/random/src/limit.py", line 26, in <module>
new_top_feature.sub_features[i] = [SeqFeature(FeatureLocation(rec.features[i].location.start , rec.features[i+1].location.end ,strand=rec.features[i].strand), type="Transcription_unit", qualifiers=new_sub_qualifiers)]
IndexError: list assignment index out of range
虽然这是一个索引超出范围的错误,但我还是搞不清楚哪里出了问题。
in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
for i in range(10):
if rec.features[i].location.end == rec.features[i+1].location.start :
print 1
else:
print rec.features[i]
in_handle.close()
这个程序运行得很好,并且打印出了所有的特征。
1 个回答
0
你定义了一个叫做 new_top_feature 的东西:
type: genomeRegion
location: [0:2821361](+)
qualifiers:
Key: ID, Value: CP000253.1
Key: source, Value: prediction
但是它没有子特性
>>> print new_top_feature.sub_features
[]
new_top_feature.sub_features
现在是一个空列表。你不能直接给一个空列表赋值:
>>> a = []
>>> a[0] = 3
Traceback (most recent call last):
File "<input>", line 1, in <module>
IndexError: list assignment index out of range
而你现在做的就是这个
new_top_feature.sub_features[i] = .....
要往这个列表里添加数据,你应该使用 append
方法,而不是直接通过索引来赋值。如果你需要在特定的位置随机填充列表,可以先创建一个合适大小的列表,里面填满零,然后再把值放到相应的位置上。