如果dna序列不能被三整除，有没有办法将一系列dna序列翻译成氨基酸？

input_file = 'inserts.txt' with open(input_file, 'r') as f: seq = f.readlines() seq = [s.replace(" ", "").replace(",", "").replace("'", "").replace("\n", "") for s in seq] print("\n".join(seq[:99])) print("\nType lookup", type(seq)) # translation function and NNN codon table as a dict object def translate(seq): nnn_table = {'TTT': 'F', 'TCT': 'S', 'TAT': 'Y', 'TGT': 'C', 'TTC': 'F', 'TCC': 'S', 'TAC': 'Y', 'TGC': 'C', 'TTA': 'L', 'TCA': 'S', 'TAA': '*', 'TGA': '*', 'TTG': 'L', 'TCG': 'S', 'TAG': '*', 'TGG': 'W', 'CTT': 'L', 'CCT': 'P', 'CAT': 'H', 'CGT': 'R', 'CTC': 'L', 'CCC': 'P', 'CAC': 'H', 'CGC': 'R', 'CTA': 'L', 'CCA': 'P', 'CAA': 'Q', 'CGA': 'R', 'CTG': 'L', 'CCG': 'P', 'CAG': 'Q', 'CGG': 'R', 'ATT': 'I', 'ACT': 'T', 'AAT': 'N', 'AGT': 'S', 'ATC': 'I', 'ACC': 'T', 'AAC': 'N', 'AGC': 'S', 'ATA': 'I', 'ACA': 'T', 'AAA': 'K', 'AGA': 'R', 'ATG': 'M', 'ACG': 'T', 'AAG': 'K', 'AGG': 'R', 'GTT': 'V', 'GCT': 'A', 'GAT': 'D', 'GGT': 'G', 'GTC': 'V', 'GCC': 'A', 'GAC': 'D', 'GGC': 'G', 'GTA': 'V', 'GCA': 'A', 'GAA': 'E', 'GGA': 'G', 'GTG': 'V', 'GCG': 'A', 'GAG': 'E', 'GGG': 'G'} # two loops, outer one to loop over the list of string sequences # inner one loops over each sequence nnn_aa_seq = [] # generate amino acid sequence # add option for sequence or codon not divisible by three print("\nStarting to translate:") for dna in seq: protein_seq = "" for i in range(0, len(dna), 3): if len(dna) % 3 == 0: nnn_codon = nnn_table[dna[i:i + 3]] protein_seq += nnn_codon nnn_aa_seq.append(protein_seq) return "".join(nnn_aa_seq) translate_nnn = translate(seq) print(tranlate_nnn) # do other stuff

Starting to translate **T*TA*TA**TA*Y*TA*YR*TA*YR**TA*YR*L*TA*YR*LR*TA*YR*LRR*TA*YR*LRRQ*TA*YR*LRRQ**TA*YR*LRRQ*Q*TA*YR*LRRQ*QQ*TA*YR*LRRQ*QQP*TA*YR*LRRQ*QQPS*TA*YR*LRRQ*QQPSP*TA*YR*LRRQ*QQPSPT*TA*YR*LRRQ*QQPSPTH*TA*YR*LRRQ*QQPSPTHN*TA*YR*

2条回答

网友

1楼 · 编辑于 2024-06-11 06:43:04

你在干什么

for dna in seq:
    protein_seq = ""
    for i in range(0, len(dna), 3):
        if len(dna) % 3 == 0:
            nnn_codon = nnn_table[dna[i:i + 3]]
            protein_seq += nnn_codon
        nnn_aa_seq.append(protein_seq)

这意味着您正在检查len(dna)是否可以被3整除很多次，而无需这样做dna在每个外部for循环运行中，长度是恒定的，因此您可以在启动内部for循环之前检查该长度，并提供关于该长度的明确信息，如下所示

for dna in seq:
    protein_seq = ""
    if len(dna) % 3 != 0:
        print('DNA length not divisible by 3')
        continue  # go to next element of seq
    for i in range(0, len(dna), 3):
        nnn_codon = nnn_table[dna[i:i + 3]]
        protein_seq += nnn_codon
        nnn_aa_seq.append(protein_seq)

网友

2楼 · 编辑于 2024-06-11 06:43:04

如果要区分结果中的序列，不应return "".join(nnn_aa_seq)，而应return "\n".join(nnn_aa_seq)，或者最好只返回带有return nnn_aa_seq的整个列表

至于那些不能被3整除的序列，你为什么认为它们是蛋白质编码序列呢？如果是的话，那么期望你的标记匹配模式准确捕捉基因的起始点是现实的吗？我没有看到很多起始密码子

如果你认为这些是基因片段，那么每个序列都有三种可能的翻译，这取决于你如何排列密码子框架。所以你可以试试这样的东西：

for dna in seq:
    protein_seq_candidates = []
    for start in (0, 1, 2):
        protein_seq = []  
        for i in range(start, len(dna), 3):
            nnn_codon = nnn_table[dna[i:i + 3]]
            protein_seq.append(nnn_codon)
        protein_seq_candidates.append(protein_seq)

    # Compare the three aa sequences in terms of biological plausibility,
    # e.g. check for stop codons within the sequences, aa distribution, etc.
    # Pick the best one (or none) and append it to nnn_aa_seq.

相关问题更多 >

编程相关推荐

热门问题

热门文章