如何识别基因组中特定位置的特征

from Bio import SeqIO import random GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank") interesting_position = random.randint(0, len(GenomeSeq)) for feature in GenomeSeq.features: # loop each position through whole genome # In this particular case I'm interested in focusing on cds, but # in others, I may be interested in other feature types? if feature.type == "CDS": if (feature.location._start.position <= interesting_position and feature.location._end.position >= interesting_position): try: print feature.qualifiers['gene'] except KeyError: print feature

2条回答

网友

1楼 · 编辑于 2024-04-19 07:12:32

很老了，但我要试试。假设您想检查给定基因组的随机点列表（而不是多个基因组的固定点集）：

from Bio import SeqIO
import random

GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")

# Here I generate a lot of random points in the genome in a set
interesting_positions = set(random.sample(xrange(1, len(GenomeSeq)), 20000))


for feature in GenomeSeq.features:

    if feature.type == "CDS":
        # Here I create a set for every position covered by a feature
        feature_span = set(xrange(feature.location._start.position,
                                  feature.location._end.position))

        # And this is the "trick": check if the two sets overlaps: if any point
        # is in the interesting_positions AND in the points covered by this
        # feature, we are interested in the feature.
        if feature_span & interesting_positions:
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

对于20000个点和一个4.7Mb的基因组，在我的蹩脚的2003计算机中，这个循环大约需要3秒，对于200000个随机点，循环需要5秒。在

在分析和重构之后，我提取了所有只需计算一次的行，就可以得到这样的结果：

^{pr2}$

每个样本大约需要0.2秒，读取/准备基因组需要4秒。在我的这台旧机器上做100万个样本大约需要50个小时。作为一个随机抽样，在一台多红色机器中启动一些进程，明天就可以完成了。在

网友

2楼 · 编辑于 2024-04-19 07:12:32

我从未使用过BioPython，但我在它的文档中发现了这个：http://biopython.org/wiki/SeqIO

from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print records[0].id  #first record
print records[-1].id #last record

这就是你要找的吗？在

相关问题更多 >

编程相关推荐

热门问题

热门文章