Python - 计数核苷酸
我需要写一个程序来计算DNA序列的数量,还要统计每对碱基的数量。到目前为止,我写了这些:
class dnaString (str):
def __new__(self,s):
return str.__new__(self,s.upper())
def length (self):
return (len(self))
def getATCG (self,num_A,num_T,num_C,num_G):
num_A = self.count("A")
num_T = self.count("T")
num_C = self.count ("C")
num_G = self.count ("G")
return ( (self.length(), num_A, num_T, num_G, num_C) )
def printnum_A (self):
print ("Adenine base content: {0}".format(self.count("A")))
dna = input("Enter a dna sequence: ")
x=dnaString(dna)
这个程序其实没什么用,因为我刚开始学Python,不太知道该怎么修改才能让它正常工作。我还知道它没有完成,接下来我应该加些什么呢?
4 个回答
-1
你可以通过使用字典来整理和获取你的计数。举个例子:
DNASeq = raw_input("Enter a DNA sequence: ")
SeqLength = len(DNASeq)
print 'Sequence Length:', SeqLength
BaseKey = list(set(DNASeq)) #creates a list from the unique characters in the DNASeq
Dict = {}
for char in BaseKey:
Dict[char] = DNASeq.count(char)
print Dict
0
我觉得这个类可以稍微简化一下:
class DnaString(str):
def __new__(self, s):
return str.__new__(self, s.strip().upper())
def __init__(self, _):
self.num_A = self.count("A")
self.num_C = self.count("C")
self.num_G = self.count("G")
self.num_T = self.count("T")
def stats(self):
return len(self), self.num_A, self.num_C, self.num_G, self.num_T
然后
dna = raw_input("Enter a dna sequence: ")
d = DnaString(dna)
print(d)
print(d.stats())
结果是
Enter a dna sequence: ACGTACGTA
ACGTACGTA
(9, 3, 2, 2, 2)
0
这有帮助吗?我安装的Python 2.7.3和3.2.3都能正常运行。
import itertools
import sys
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
if sys.version_info[0] > 2:
return zip(a,b)
return itertools.izip(a, b)
class DnaSequence():
Names = {
'A' : 'adenine',
'C' : 'cytosine',
'G' : 'guanine',
'T' : 'thymine'
}
Bases = Names.keys()
def __init__(self, seq):
self._string = seq
self.bases = { x:0 for x in DnaSequence.Bases }
self.pairs = { x+y:0 for x in DnaSequence.Bases
for y in DnaSequence.Bases }
for base in seq:
if base in self.bases:
self.bases[base] += 1
for x,y in pairwise(seq):
pair = x+y
if pair in self.pairs:
self.pairs[pair] += 1
def printCount(self, base):
if base in DnaSequence.Names:
print(DnaSequence.Names[base].capitalize() +
" base content: " + str(self.bases[base]))
else:
sys.stderr.write('No such base ("%s")\n' % base)
def __repr__(self):
return self._string
d = DnaSequence("CCTAGTGTTAGCTAGTCTAGGGAT")
for base in DnaSequence.Bases:
d.printCount(base)
# Further:
print(d)
print(d.bases)
print(d.pairs)
这是一个完整的例子,它会计算碱基(A、C、G、T)以及所有相邻对的出现次数(比如在ACCGTA这个字符串中,AC、CC、CG、GT、TA这几个对的出现次数都是1,而其他11种可能的组合在这里都是0)。
这里使用的计数方法是在构造函数中只扫描字符串一次,而不是每次调用getATGC()
时都扫描四次。
1
我不太明白你问的是什么,但因为你没有调用方法 `printnum_A`,所以什么都没有打印出来。如果你像这样调用它,就能正常工作:
dna = input("Enter a dna sequence: ")
x=dnaString(dna)
x.printnum_A()
根据评论更新
仅仅声明一个类的方法是不够的,你还需要在需要的时候调用它们。就像这里的 printnum_T
:
class dnaString (str):
def __new__(self,s):
return str.__new__(self,s.upper())
def length (self):
return (len(self))
def getATCG (self,num_A,num_T,num_C,num_G):
num_A = self.count("A")
num_T = self.count("T")
num_C = self.count ("C")
num_G = self.count ("G")
return ( (self.length(), num_A, num_T, num_G, num_C) )
def printnum_A (self):
print ("Adenine base content: {0}".format(self.count("A")))
# here the method is declared
def printnum_T (self):
print ("Adenine base content: {0}".format(self.count("T")))
dna = input("Enter a dna sequence: ")
x=dnaString(dna)
x.printnum_A()
# Here I call my method on `x`
x.printnum_T()