使用蛋白质基因标识符获取DNA序列

1 投票
2 回答
2223 浏览
提问于 2025-05-01 14:21

我正在使用Biopython来尝试获取与我拥有的一个蛋白质对应的DNA序列,这个蛋白质的GI号是71743840。在NCBI网站上,这个过程非常简单,我只需要查找refseq(参考序列)。但在用Python编写代码时,我遇到了问题,使用ncbi的获取工具,我找不到任何可以帮助我获取DNA序列的字段。

handle = Entrez.efetch(db="nucleotide", id=blast_record.alignments[0].hit_id, rettype="gb", retmode="text")
seq_record=SeqIO.read(handle,"gb")

在seq_record.features中有很多信息,但我觉得应该有更简单、更明显的方法来做到这一点,任何帮助都非常感谢。谢谢!

暂无标签

2 个回答

0

你可以利用Entrez.elink这个工具,来请求与某个核酸序列的UID对应的蛋白质序列的UID:

from Bio import Entrez
from Bio import SeqIO
email = 'seb@free.fr'
term = 'NM_207618.2' #fro example, accession/version

### first step, we search for the nucleotide sequence of interest
h_search = Entrez.esearch(
        db='nucleotide', email=email, term=term)
record = Entrez.read(h_search)
h_search.close()

### second step, we fetch the UID of that nt sequence
handle_nt = Entrez.efetch(
        db='nucleotide', email=email, 
        id=record['IdList'][0], rettype='fasta') # here is the UID

### third and most important, we 'link' the UID of the nucleotide
# sequence to the corresponding protein from the appropriate database
results = Entrez.read(Entrez.elink(
        dbfrom='nucleotide', linkname='nucleotide_protein',
        email=email, id=record['IdList'][0]))

### last, we fetch the amino acid sequence
handle_aa = Entrez.efetch(
        db='protein', email=email, 
        id=results[0]['LinkSetDb'][0]['Link'][0]['Id'], # here is the key...
        rettype='fasta')
0

你可以试着访问SeqRecord的注释信息:

seq_record=SeqIO.read(handle,"gb")
nucleotide_accession = seq_record.annotations["db_source"]

在你的例子中,nucleotide_accession是“REFSEQ: accession NM_000673.4”

现在看看你能否解析这些注释信息。只用这个测试案例:

nucl_id = nucleotide_accession.split()[-1]

handle = Entrez.efetch(db="nucleotide",
                       id=nucl_id,
                       rettype="gb",
                       retmode="text")
seq_record = SeqIO.read(handle, "gb")

撰写回答