Python - 计数核苷酸

1 投票
4 回答
6749 浏览
提问于 2025-04-18 01:38

我需要写一个程序来计算DNA序列的数量,还要统计每对碱基的数量。到目前为止,我写了这些:

class dnaString (str): 
    def __new__(self,s): 
        return str.__new__(self,s.upper()) 
    def length (self): 
        return (len(self)) 
    def getATCG (self,num_A,num_T,num_C,num_G): 
        num_A = self.count("A") 
        num_T = self.count("T") 
        num_C = self.count ("C") 
        num_G = self.count ("G") 
        return ( (self.length(), num_A, num_T, num_G, num_C) ) 

    def printnum_A (self): 
        print ("Adenine base content: {0}".format(self.count("A"))) 

dna = input("Enter a dna sequence: ") 
x=dnaString(dna) 

这个程序其实没什么用,因为我刚开始学Python,不太知道该怎么修改才能让它正常工作。我还知道它没有完成,接下来我应该加些什么呢?

4 个回答

-1

你可以通过使用字典来整理和获取你的计数。举个例子:

    DNASeq = raw_input("Enter a DNA sequence: ")

    SeqLength = len(DNASeq)

    print 'Sequence Length:', SeqLength

    BaseKey = list(set(DNASeq)) #creates a list from the unique characters in the DNASeq

    Dict = {}

    for char in BaseKey:
        Dict[char] = DNASeq.count(char)
    print Dict
0

我觉得这个类可以稍微简化一下:

class DnaString(str): 
    def __new__(self, s): 
        return str.__new__(self, s.strip().upper())

    def __init__(self, _):
        self.num_A = self.count("A") 
        self.num_C = self.count("C") 
        self.num_G = self.count("G") 
        self.num_T = self.count("T") 

    def stats(self):
        return len(self), self.num_A, self.num_C, self.num_G, self.num_T

然后

dna = raw_input("Enter a dna sequence: ") 
d = DnaString(dna)

print(d)
print(d.stats())

结果是

Enter a dna sequence: ACGTACGTA
ACGTACGTA
(9, 3, 2, 2, 2)
0

这有帮助吗?我安装的Python 2.7.3和3.2.3都能正常运行。

import itertools
import sys

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    if sys.version_info[0] > 2:
        return zip(a,b)
    return itertools.izip(a, b)

class DnaSequence():
    Names = {
        'A' : 'adenine',
        'C' : 'cytosine',
        'G' : 'guanine',
        'T' : 'thymine'
    }
    Bases = Names.keys()


    def __init__(self, seq):
        self._string = seq
        self.bases = { x:0 for x in DnaSequence.Bases }
        self.pairs = { x+y:0 for x in DnaSequence.Bases 
                             for y in DnaSequence.Bases }

        for base in seq:
            if base in self.bases:
                self.bases[base] += 1

        for x,y in pairwise(seq):
            pair = x+y
            if pair in self.pairs:
                self.pairs[pair] += 1


    def printCount(self, base):
        if base in DnaSequence.Names:
            print(DnaSequence.Names[base].capitalize() + 
                  " base content: " + str(self.bases[base]))
        else:
            sys.stderr.write('No such base ("%s")\n' % base)


    def __repr__(self):
        return self._string


d = DnaSequence("CCTAGTGTTAGCTAGTCTAGGGAT")
for base in DnaSequence.Bases:
    d.printCount(base)

# Further:
print(d)
print(d.bases)
print(d.pairs)

这是一个完整的例子,它会计算碱基(A、C、G、T)以及所有相邻对的出现次数(比如在ACCGTA这个字符串中,AC、CC、CG、GT、TA这几个对的出现次数都是1,而其他11种可能的组合在这里都是0)。

这里使用的计数方法是在构造函数中只扫描字符串一次,而不是每次调用getATGC()时都扫描四次。

1

我不太明白你问的是什么,但因为你没有调用方法 `printnum_A`,所以什么都没有打印出来。如果你像这样调用它,就能正常工作:

dna = input("Enter a dna sequence: ") 
x=dnaString(dna) 
x.printnum_A()

根据评论更新

仅仅声明一个类的方法是不够的,你还需要在需要的时候调用它们。就像这里的 printnum_T

class dnaString (str): 
    def __new__(self,s): 
        return str.__new__(self,s.upper()) 
    def length (self): 
        return (len(self)) 
    def getATCG (self,num_A,num_T,num_C,num_G): 
        num_A = self.count("A") 
        num_T = self.count("T") 
        num_C = self.count ("C") 
        num_G = self.count ("G") 
        return ( (self.length(), num_A, num_T, num_G, num_C) ) 

    def printnum_A (self): 
        print ("Adenine base content: {0}".format(self.count("A"))) 

    # here the method is declared
    def printnum_T (self): 
        print ("Adenine base content: {0}".format(self.count("T")))

dna = input("Enter a dna sequence: ") 
x=dnaString(dna) 
x.printnum_A()
# Here I call my method on `x`
x.printnum_T()

撰写回答