<p>既然你已经给它贴上了Biopython的标签,我想你知道Biopython。你查过文件了吗?<a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231" rel="noreferrer">http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc231</a>可能有帮助。</p>
<p>我从上面的链接中调整了一点代码来处理您的序列:</p>
<pre><code>from Bio.Seq import Seq
seq = Seq("CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGCAAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGAGCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGGCGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGTGATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGATCATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTGAGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTACCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTGCTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAGCTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAATCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCATCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTATCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAAGATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGGAGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGATCTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCCCTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGCCGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCCCACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATAACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACGGGTCATTGAGAGAAATAA")
table = 1
min_pro_len = 100
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
for frame in range(3):
for pro in nuc[frame:].translate(table).split("*"):
if len(pro) >= min_pro_len:
print "%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)
</code></pre>
<p>ORF也被翻译。您可以选择不同的翻译表。签出<a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation" rel="noreferrer">http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:translation</a></p>
<p><strong>编辑:</strong>代码说明:</p>
<p>在顶部,我用字符串创建了一个序列对象。注意<code>seq = Seq("ACGT")</code>。
两个for循环创建6个不同的帧。内部for循环根据选择的翻译表翻译每个帧,并返回一个氨基酸链,其中每个停止密码子编码为<code>*</code>。函数<code>split</code>拆分此字符串,删除这些占位符,从而生成可能的蛋白质序列列表。通过设置min_pro_len,可以定义要检测的蛋白质的最小氨基酸链长度。1是标准表。看看<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1" rel="noreferrer">http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1</a>这里你可以看到起始密码子是<code>AUG</code>(等于<code>ATG</code>),末端密码子是<code>TAA</code>、<code>TAG</code>和<code>TGA</code>,就像你想要的那样。您也可以使用不同的翻译表。</p>
<p>当你添加</p>
<pre><code>print nuc[frame:].translate(table)
</code></pre>
<p>在第二个for循环中,您会得到如下信息:</p>
<pre><code>PQRGQQGTSQEGEQKLQNILEIAPRKASSQPGRFCPFHSLAQGATSPSRKDTMSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQRDEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGMDLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL*REWVFIHSLPSPHSDPFTLTPLLSTPQSPQSVSF*LRKGIMAQGPTLCSELSTTTQKHKMLGQ*PGLWASHAPPSRTQMGFPNSLEPRMSIPEFCKGRVVRLPLSQNEAG*DLRPSYLQTFPDSSLRCNAQPSSQSQPPSIYICTYYLLFIYYLFICL*MYLFGRPGCPGGPSVGSCLQTDMFSVKTELSCPHLASLPCCLLFCLCLKQNIYLTQLS**R*FGDQAVATSLNLCSPREP*L*SPYGSLREI
</code></pre>
<p>(注意星号在停止密码子位置)</p>
<p><strong>编辑:回答第二个问题:</strong></p>
<p>必须返回要写入文件的字符串。创建输出字符串并在函数结束时返回:</p>
<pre><code>output = "selected tupple is " + str(selected_tupple) + "\n"
output += final_seq + "\n"
output += "The longest orf length is " + str(max_val) + "\n"
return output
</code></pre>