定义一个计算氨基酸相对频率的函数
我正在尝试计算一段DNA序列中每个密码子的出现频率。
举个例子:
sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
对于每个密码子XX:
frequency = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))
需要注意的是,XX2和XX3不一定会出现在序列中。有些密码子可能有多个不同的形式。
比如:赖氨酸有两个密码子,分别是AAA和AAG。
所以我们要计算的频率是:
AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))
我该如何对列表中的每个密码子都进行这样的计算呢?我该如何处理多个密码子的情况?
5 个回答
一个包含所有64个密码子的密码子表,甚至包括那些不退化的密码子(它们组成一个元素组)
在遍历过程中,同时统计每个密码子组的出现次数,以及密码子的出现次数
一个包含编码氨基酸名称的密码子表 -> 显示效果很好
代码:
from collections import defaultdict
# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
arn = adn.upper().replace('T','U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'), 'Alanine'),
(('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'), 'Leucine'),
(('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'), 'Arginine'),
(('AAA', 'AAG'), 'Lysine'),
(('AAU', 'AAC'), 'Asparagine'),
(('AUG',), 'Methionine'),
(('GAU', 'GAC'), 'Aspartic acid' ),
(('UUU', 'UUC'), 'Phenylalanine'),
(('UGU', 'UGC'), 'Cysteine'),
(('CCU', 'CCC', 'CCA', 'CCG'), 'Proline') ,
(('CAA', 'CAG'), 'Glutamine'),
(('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'), 'Serine'),
(('GAA', 'GAG'), 'Glutamic acid'),
(('ACU', 'ACC', 'ACA', 'ACG'), 'Threonine'),
(('GGU', 'GGC', 'GGA', 'GGG'), 'Glycine'),
(('UGG',), 'Tryptophane'),
(('CAU', 'CAC'), 'Histidine'),
(('UAU', 'UAC'), 'Tyrosine'),
(('AUU', 'AUC', 'AUA'), 'Isoleucine'),
(('GUU', 'GUC', 'GUA', 'GUG'), 'Valine'),
(('UAA', 'UGA', 'UAG'), 'STOP') )
siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )
cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}
for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
cod_count[cod] += 1
grp_count[siblings[cod]] += 1
for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
if siblings[cod] in grp_count:
freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
else:
freq[cod] = '-* Missing *-'
display = '\n'.join(aa.rjust(13)+\
'\n'.join('%s %-16s' % (cod.rjust(18 if i else 5),freq[cod])
for i,cod in enumerate(codgrp))
for codgrp,aa in codon_table)
# editing addition:
def outputResults(filename,arn,codon_table,displ):
li = ['This file is named %s' % filename]
li.append('The sequence of ARN:\n%s' %\
'\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
li.append('Size of the sequence : '+str(len(arn)))
li.append('Codon_table:\n'+\
'\n'.join('%s : %s' % (u,v) for u,v in codon_table))
li.append('Frequency results :\n'+displ)
with open(filename,'w') as f:
f.writelines('\n\n'.join(li))
outputResults('ARN_mem.txt',arn,codon_table,display)
print display
.
编辑
我添加了一个函数outputResults(),用来展示如何将数据和结果记录到文件中
生成的文件内容是:
This file is named ARN_mem.txt
The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA
Size of the sequence : 360
Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP
Frequency results :
Alanine GCU 0.1875
GCC 0.375
GCA 0.25
GCG 0.1875
Leucine UUA 0.125
UUG 0.0
CUU 0.25
CUC 0.375
etc.............
如果你的序列是正确的阅读框:
>>> import collections
>>>
>>> code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
... 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
... 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
... 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
... 'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
... 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
... 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
... 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
... 'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
... 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
... 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
... 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
... 'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
... 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
... 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
... 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
... }
>>> def count_codons(cds):
... counts = collections.defaultdict(int)
... for i in range(0,len(cds),3):
... codon = cds[i:i+3]
... counts[codon] += 1
... return counts
...
>>> def translate(cds, code):
... return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
...
>>> seq = 'ATGAAGAAA'
>>>
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>>
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>>
其他信息:
我觉得你是在问一种叫做密码子使用的东西。
网上有一些工具可以帮助你找到密码子的使用情况。这个工具还支持离线使用。
http://www.bioinformatics.org/sms2/codon_usage.html
结果中(在这个“Fraction”就是你想要的内容):
Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid Codon Number /1000 Fraction ..
Ala GCG 0.00 0.00 0.00
Ala GCA 0.00 0.00 0.00
Ala GCT 0.00 0.00 0.00
Ala GCC 0.00 0.00 0.00
Cys TGT 0.00 0.00 0.00
Cys TGC 0.00 0.00 0.00
Asp GAT 0.00 0.00 0.00
Asp GAC 0.00 0.00 0.00
Glu GAG 0.00 0.00 0.00
Glu GAA 0.00 0.00 0.00
Phe TTT 0.00 0.00 0.00
Phe TTC 0.00 0.00 0.00
Gly GGG 0.00 0.00 0.00
Gly GGA 0.00 0.00 0.00
Gly GGT 0.00 0.00 0.00
Gly GGC 0.00 0.00 0.00
His CAT 0.00 0.00 0.00
His CAC 0.00 0.00 0.00
Ile ATA 0.00 0.00 0.00
Ile ATT 0.00 0.00 0.00
Ile ATC 0.00 0.00 0.00
Lys AAG 1.00 333.33 0.50
Lys AAA 1.00 333.33 0.50
Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 0.00 0.00 0.00
Leu CTC 0.00 0.00 0.00
Met ATG 1.00 333.33 1.00
Asn AAT 0.00 0.00 0.00
Asn AAC 0.00 0.00 0.00
Pro CCG 0.00 0.00 0.00
Pro CCA 0.00 0.00 0.00
Pro CCT 0.00 0.00 0.00
Pro CCC 0.00 0.00 0.00
Gln CAG 0.00 0.00 0.00
Gln CAA 0.00 0.00 0.00
Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 0.00 0.00 0.00
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 0.00 0.00 0.00
Ser AGT 0.00 0.00 0.00
Ser AGC 0.00 0.00 0.00
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00
Thr ACG 0.00 0.00 0.00
Thr ACA 0.00 0.00 0.00
Thr ACT 0.00 0.00 0.00
Thr ACC 0.00 0.00 0.00
Val GTG 0.00 0.00 0.00
Val GTA 0.00 0.00 0.00
Val GTT 0.00 0.00 0.00
Val GTC 0.00 0.00 0.00
Trp TGG 0.00 0.00 0.00
Tyr TAT 0.00 0.00 0.00
Tyr TAC 0.00 0.00 0.00
End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00
cusp是来自EMBOSS的密码子使用工具,值得一看。
你可能还想了解一下BioPython,它可以用来处理生物序列。我相信他们有一个密码子使用的模块。
使用 defaultdict
from collections import defaultdict
mydict = defaultdict(int)
for aa in mysecuence:
mydict[aa] +=1
这适用于氨基酸(蛋白质)。
对于密码子,你应该每次跳过三个位置来遍历序列,以获取 defaultdict 的键。例如:
>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a
['GAU', 'CAC', 'TUG', 'CCA']
编辑:如果你想计算退化现象,你需要准备一个字典,将每个密码子(键)与它的退化密码子(值,密码子的列表)关联起来。为了计算频率,你可以从 defaultdict 中获取每个密码子的计数,然后对每个密码子,计算从上面提到的密码子字典中读取的退化密码子的计数总和。这样你就可以计算出频率。
编辑 2:这里有一个真实的例子:
from collections import defaultdict
#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
seq = rna.upper().replace('T', 'U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
('UAA', 'UGA', 'UAG'))
#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
for codon in codons:
degen_dict[codon] = codons
#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]
#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
counts[codon] +=1
#actual calculation of frecuencies
data = {}
for codon in query_codons:
if codon in degen_dict:
totals = sum(counts[deg] for deg in degen_dict[codon])
frecuency = float(counts[codon]) / totals
else:
frecuency = 1.00
data[codon] = frecuency
#print results
for codon, frecuency in data.iteritems():
print "%s -> %.2f" %(codon, frecuency)
#produces:
GUC -> 0.57
AUA -> 1.00
ACG -> 0.50
AAC -> 1.00
CCU -> 0.25
UAU -> 1.00
..........
GCU -> 0.19
GAU -> 1.00
UAG -> 0.33
CUC -> 0.38
UUA -> 0.13
UGA -> 0.33