如何使用Biopython翻译FASTA文件中的一系列DNA序列，并将蛋白质序列提取到单独的字段中？

from Bio import SeqIO from Bio.SeqRecord import SeqRecord for record in SeqIO.parse("dnaseq.fasta", "fasta"): protein_id = record.id protein1 = record.seq.translate(to_stop=True) protein2 = record.seq[1:].translate(to_stop=True) protein3 = record.seq[2:].translate(to_stop=True) if len(protein1) > len(protein2) and len(protein1) > len(protein3): protein = protein1 elif len(protein2) > len(protein1) and len(protein2) > len(protein3): protein = protein2 else: protein = protein3 def prot_record(record): return SeqRecord(seq = protein, \ id = ">" + protein_id, \ description = "translated sequence") records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta")) SeqIO.write(records, "AAseq.fasta", "fasta")

2条回答

网友

1楼 · 编辑于 2024-04-23 11:32:38

正如其他人所提到的，您的代码在尝试编写结果之前会遍历整个输入。我想建议如何使用流媒体方法来实现这一点：

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

这里的主要改进是：

在每次迭代中转换并输出单个记录，从而最大限度地减少内存使用。在
添加对反向补码的支持。在
翻译后的帧通过生成器理解创建，并且只存储最长的帧。在
通过使用带有键的max来避免if...elif...else结构。在

网友

2楼 · 编辑于 2024-04-23 11:32:38

您的if在for循环之外，因此它只应用一次，使用变量及其在循环的最后一次迭代结束时的值。如果您希望if在每次迭代中都发生，则需要在与之前的代码相同的级别缩进：

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # Same indentation level, still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3

您的函数prot_record使用当前值protein和{}，这也是for循环最后一次迭代结束时的值。在

如果我猜对了您想要什么，一种可能是将这个函数声明也放在循环中，以便函数具有一个特定的行为，这取决于循环的当前迭代，并将函数保存在一个列表中以供以后在再次迭代记录时使用。但我不确定这是否有效：

^{pr2}$

另一种可能的方法是将翻译逻辑放在记录功能中：

^{3}$

这可能更干净。我还没有测试。在

相关问题更多 >

编程相关推荐

热门问题

热门文章