使用Biopython通过BLAST获取未知序列的详细信息

0 投票

1 回答

1491 浏览

提问于 2025-04-18 17:39

我第一次使用Biopython。我有一些来自未知生物的序列数据，想用BLAST来判断这些序列最有可能来自哪个生物。我写了下面这个函数来实现这个目的：

def find_organism(file):
    """
    Receives a fasta file with a single seq, and uses BLAST to find
    from which organism it was taken.
    """
    # get seq from fasta file
    seqRecord = SeqIO.read(file,"fasta")
    # run BLAST
    blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)
    # get first hit
    blastRecord = NCBIXML.read(blastResult)
    firstHit = blastRecord.alignments[0]
    # get hit's gi number
    title = firstHit.title
    gi = title.split("|")[1]
    # search NCBI for the gi number
    ncbiResult = Entrez.efetch(db="nucleotide", id=gi, rettype="gb", retmode="text")
    ncbiResultSeqRec = SeqIO.read(ncbiResult,"gb")
    # get organism
    annotatDict = ncbiResultSeqRec.annotations
    return(annotatDict['organism'])

这个函数运行得还不错，但每个物种获取生物信息大约需要2分钟，我觉得这太慢了。我在想有没有更好的方法。我知道可以创建一个NCBI的本地副本来提高性能，我可能会考虑这样做。不过，我怀疑先查询BLAST，然后再用得到的ID去查询Entrez，这样的做法是否合适。你们有没有其他的改进建议？
谢谢！

性能优化生物信息学 ncbi biopython blast 序列比对 entez

1 个回答

你可以通过以下方式获取生物体的信息：

[...]
blastResult = NCBIWWW.qblast("blastn", "nt", seqRecord.seq)
blastRecord = NCBIXML.read(blastResult)

first_organism = blastRecord.descriptions[0]

这样做至少可以保存efetch查询的结果。不过，"blastn"这个过程可能会花费很长时间。如果你打算大量查询NCBI的数据，可能会被禁止访问。

回答于 2025-04-18 由 Python大师

分享举报

使用Biopython通过BLAST获取未知序列的详细信息

1 个回答

撰写回答