定义一个计算氨基酸相对频率的函数

3 投票
5 回答
7279 浏览
提问于 2025-04-16 16:21

我正在尝试计算一段DNA序列中每个密码子的出现频率。

举个例子:

sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']

对于每个密码子XX:

frequency  = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))

需要注意的是,XX2和XX3不一定会出现在序列中。有些密码子可能有多个不同的形式。

比如:赖氨酸有两个密码子,分别是AAA和AAG。

所以我们要计算的频率是:

AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))

我该如何对列表中的每个密码子都进行这样的计算呢?我该如何处理多个密码子的情况?

5 个回答

1
  • 一个包含所有64个密码子的密码子表,甚至包括那些不退化的密码子(它们组成一个元素组)

  • 在遍历过程中,同时统计每个密码子组的出现次数,以及密码子的出现次数

  • 一个包含编码氨基酸名称的密码子表 -> 显示效果很好

代码:

from collections import defaultdict

# the first 600 nucleotides from GenBank: AAHX01097212.1
adn = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

arn = adn.upper().replace('T','U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
codon_table = ((('GCU', 'GCC', 'GCA', 'GCG'),  'Alanine'),
               (('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),  'Leucine'),
               (('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),  'Arginine'),
               (('AAA', 'AAG'),  'Lysine'),
               (('AAU', 'AAC'),  'Asparagine'),
               (('AUG',),  'Methionine'),
               (('GAU', 'GAC'),  'Aspartic acid' ),              
               (('UUU', 'UUC'),  'Phenylalanine'),
               (('UGU', 'UGC'),  'Cysteine'),
               (('CCU', 'CCC', 'CCA', 'CCG'),  'Proline') ,
               (('CAA', 'CAG'),  'Glutamine'),
               (('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),  'Serine'),
               (('GAA', 'GAG'),  'Glutamic acid'),
               (('ACU', 'ACC', 'ACA', 'ACG'),  'Threonine'),
               (('GGU', 'GGC', 'GGA', 'GGG'),  'Glycine'),
               (('UGG',),  'Tryptophane'),
               (('CAU', 'CAC'),  'Histidine'),
               (('UAU', 'UAC'),  'Tyrosine'),
               (('AUU', 'AUC', 'AUA'),  'Isoleucine'),
               (('GUU', 'GUC', 'GUA', 'GUG'),  'Valine'),
               (('UAA', 'UGA', 'UAG'),  'STOP')            )

siblings = dict( (cod, codgroup) for codgroup,aa in codon_table for cod in codgroup )

cod_count, grp_count, freq = defaultdict(int), defaultdict(int), {}

for cod in (arn[i:i+3] for i in xrange(0,len(arn),3)):
    cod_count[cod] += 1
    grp_count[siblings[cod]] += 1

for cod in siblings.iterkeys(): # the keys of siblings are the 64 codons
    if siblings[cod] in grp_count:
        freq[cod] = float(cod_count[cod])/grp_count[siblings[cod]]
    else:
        freq[cod] = '-* Missing *-'


display = '\n'.join(aa.rjust(13)+\
                '\n'.join('%s  %-16s' % (cod.rjust(18 if i else 5),freq[cod])
                          for i,cod in enumerate(codgrp))
                for codgrp,aa in codon_table)


# editing addition:

def outputResults(filename,arn,codon_table,displ):

    li = ['This file is named %s' % filename]

    li.append('The sequence of ARN:\n%s' %\
              '\n'.join(arn[i:i+42] for i in xrange(0,len(arn),42)))
    li.append('Size of the sequence : '+str(len(arn)))

    li.append('Codon_table:\n'+\
              '\n'.join('%s : %s' % (u,v) for u,v in codon_table))

    li.append('Frequency results :\n'+displ)

    with open(filename,'w') as f:
        f.writelines('\n\n'.join(li))


outputResults('ARN_mem.txt',arn,codon_table,display)
print display 

.

编辑

我添加了一个函数outputResults(),用来展示如何将数据和结果记录到文件中

生成的文件内容是:

This file is named ARN_mem.txt

The sequence of ARN:
UCCCCCGCAGCUUCGGGAACGUGCGGGCUCGGGAGGGAGGGG
CCUGGCGCCGGGCGCGCGCCUGCGCCCCACCCCGCCCCACCC
UGGCGGGUCUCGCGCGCCCGGCCCGCCUCCUGUCAACCCCAG
CGCGGCGGUCAGGUGGUCCCCAGCCCUUGGCCCCAGCCUCCA
GCUUCCUGGUCCCUCGGGCUCUGAGUCCUGUCUCCGGCAGAU
CGCCUUUCUGAUUGUUCUCCUGCGCAGCUGGAGGUGUAUAGC
CCCUAGCCGAGCUAUGGUGCCUCAGCAGAUGUGAGGAGGUAG
UGGGUCAGGAUAAACCCGCGCACUCCAUAAUAACGUGCCAGG
GCUCAGUGACUUGGGUCUGCAUUA

Size of the sequence : 360

Codon_table:
('GCU', 'GCC', 'GCA', 'GCG') : Alanine
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG') : Leucine
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG') : Arginine
('AAA', 'AAG') : Lysine
('AAU', 'AAC') : Asparagine
('AUG',) : Methionine
('GAU', 'GAC') : Aspartic acid
('UUU', 'UUC') : Phenylalanine
('UGU', 'UGC') : Cysteine
('CCU', 'CCC', 'CCA', 'CCG') : Proline
('CAA', 'CAG') : Glutamine
('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC') : Serine
('GAA', 'GAG') : Glutamic acid
('ACU', 'ACC', 'ACA', 'ACG') : Threonine
('GGU', 'GGC', 'GGA', 'GGG') : Glycine
('UGG',) : Tryptophane
('CAU', 'CAC') : Histidine
('UAU', 'UAC') : Tyrosine
('AUU', 'AUC', 'AUA') : Isoleucine
('GUU', 'GUC', 'GUA', 'GUG') : Valine
('UAA', 'UGA', 'UAG') : STOP

Frequency results :
      Alanine  GCU  0.1875          
               GCC  0.375           
               GCA  0.25            
               GCG  0.1875          
      Leucine  UUA  0.125           
               UUG  0.0             
               CUU  0.25            
               CUC  0.375   
etc.............
2

如果你的序列是正确的阅读框:

>>> import collections
>>> 
>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> def count_codons(cds):
...     counts = collections.defaultdict(int)
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        counts[codon] += 1
...     return counts
... 
>>> def translate(cds, code):
...     return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
... 
>>> seq = 'ATGAAGAAA'
>>> 
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>> 
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>> 

其他信息:

我觉得你是在问一种叫做密码子使用的东西。

网上有一些工具可以帮助你找到密码子的使用情况。这个工具还支持离线使用。

http://www.bioinformatics.org/sms2/codon_usage.html

结果中(在这个“Fraction”就是你想要的内容):

Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid   Codon     Number        /1000     Fraction   .. 

Ala      GCG         0.00         0.00         0.00 
Ala      GCA         0.00         0.00         0.00 
Ala      GCT         0.00         0.00         0.00 
Ala      GCC         0.00         0.00         0.00 

Cys      TGT         0.00         0.00         0.00 
Cys      TGC         0.00         0.00         0.00 

Asp      GAT         0.00         0.00         0.00 
Asp      GAC         0.00         0.00         0.00 

Glu      GAG         0.00         0.00         0.00 
Glu      GAA         0.00         0.00         0.00 

Phe      TTT         0.00         0.00         0.00 
Phe      TTC         0.00         0.00         0.00 

Gly      GGG         0.00         0.00         0.00 
Gly      GGA         0.00         0.00         0.00 
Gly      GGT         0.00         0.00         0.00 
Gly      GGC         0.00         0.00         0.00 

His      CAT         0.00         0.00         0.00 
His      CAC         0.00         0.00         0.00 

Ile      ATA         0.00         0.00         0.00 
Ile      ATT         0.00         0.00         0.00 
Ile      ATC         0.00         0.00         0.00 

Lys      AAG         1.00       333.33         0.50 
Lys      AAA         1.00       333.33         0.50 

Leu      TTG         0.00         0.00         0.00 
Leu      TTA         0.00         0.00         0.00 
Leu      CTG         0.00         0.00         0.00 
Leu      CTA         0.00         0.00         0.00 
Leu      CTT         0.00         0.00         0.00 
Leu      CTC         0.00         0.00         0.00 

Met      ATG         1.00       333.33         1.00 

Asn      AAT         0.00         0.00         0.00 
Asn      AAC         0.00         0.00         0.00 

Pro      CCG         0.00         0.00         0.00 
Pro      CCA         0.00         0.00         0.00 
Pro      CCT         0.00         0.00         0.00 
Pro      CCC         0.00         0.00         0.00 

Gln      CAG         0.00         0.00         0.00 
Gln      CAA         0.00         0.00         0.00 

Arg      AGG         0.00         0.00         0.00 
Arg      AGA         0.00         0.00         0.00 
Arg      CGG         0.00         0.00         0.00 
Arg      CGA         0.00         0.00         0.00 
Arg      CGT         0.00         0.00         0.00 
Arg      CGC         0.00         0.00         0.00 

Ser      AGT         0.00         0.00         0.00 
Ser      AGC         0.00         0.00         0.00 
Ser      TCG         0.00         0.00         0.00 
Ser      TCA         0.00         0.00         0.00 
Ser      TCT         0.00         0.00         0.00 
Ser      TCC         0.00         0.00         0.00 

Thr      ACG         0.00         0.00         0.00 
Thr      ACA         0.00         0.00         0.00 
Thr      ACT         0.00         0.00         0.00 
Thr      ACC         0.00         0.00         0.00 

Val      GTG         0.00         0.00         0.00 
Val      GTA         0.00         0.00         0.00 
Val      GTT         0.00         0.00         0.00 
Val      GTC         0.00         0.00         0.00 

Trp      TGG         0.00         0.00         0.00 

Tyr      TAT         0.00         0.00         0.00 
Tyr      TAC         0.00         0.00         0.00 

End      TGA         0.00         0.00         0.00 
End      TAG         0.00         0.00         0.00 
End      TAA         0.00         0.00         0.00 

cusp是来自EMBOSS的密码子使用工具,值得一看。

你可能还想了解一下BioPython,它可以用来处理生物序列。我相信他们有一个密码子使用的模块。

6

使用 defaultdict

from collections import defaultdict

mydict = defaultdict(int)

for aa in mysecuence:
    mydict[aa] +=1

这适用于氨基酸(蛋白质)。
对于密码子,你应该每次跳过三个位置来遍历序列,以获取 defaultdict 的键。例如:

>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a


['GAU', 'CAC', 'TUG', 'CCA']

编辑:如果你想计算退化现象,你需要准备一个字典,将每个密码子(键)与它的退化密码子(值,密码子的列表)关联起来。为了计算频率,你可以从 defaultdict 中获取每个密码子的计数,然后对每个密码子,计算从上面提到的密码子字典中读取的退化密码子的计数总和。这样你就可以计算出频率。

编辑 2:这里有一个真实的例子:

from collections import defaultdict

#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

seq = rna.upper().replace('T', 'U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
               ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
               ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
               ('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
               ('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
               ('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
               ('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
               ('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
               ('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
               ('UAA', 'UGA', 'UAG'))

#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
    for codon in codons:
        degen_dict[codon] = codons

#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]

#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
    counts[codon] +=1

#actual calculation of frecuencies
data = {}
for codon in query_codons:
    if codon in  degen_dict:
        totals = sum(counts[deg] for deg in degen_dict[codon])
        frecuency = float(counts[codon]) / totals
    else:
        frecuency = 1.00

    data[codon] = frecuency

#print results
for codon, frecuency in data.iteritems():
    print "%s  -> %.2f" %(codon, frecuency)


#produces:
GUC  -> 0.57
AUA  -> 1.00
ACG  -> 0.50
AAC  -> 1.00
CCU  -> 0.25
UAU  -> 1.00
..........
GCU  -> 0.19
GAU  -> 1.00
UAG  -> 0.33
CUC  -> 0.38
UUA  -> 0.13
UGA  -> 0.33

撰写回答