频率不相加是不相等的

for fasta in seq_file: freq = {} dna = str(fasta.seq) for base1 in ['A', 'T', 'G', 'C']: onefreq = float(dna.count(base1)) / len(dna) freq[base1] = onefreq for base2 in ['A', 'T', 'G', 'C']: dinucleotide = base1 + base2 twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) freq[dinucleotide] = twofreq frequency.append(freq)

0.0894168466523 0.0760259179266 0.0946004319654 0.0561555075594 0.0431965442765 0.0423326133909 0.0747300215983 0.0488120950324 0.0976241900648 0.0483801295896 0.0539956803456 0.0423326133909 0.0863930885529 0.0419006479482 0.0190064794816 0.031101511879 (0.9460043196544274, 1.0) 2316

3条回答

网友

1楼 · 编辑于 2024-05-16 06:02:35

您的问题，请尝试以下fasta：

>test
AAAAAA

^{pr2}$

你会得到：

应该是的

原因

从documentation:count返回字符串s中子字符串sub的出现次数（非重叠）[开始：结束]

解决方案使用Counter和chunk函数

from Bio import SeqIO

def chunks(l, n):
  for i in xrange(0, len(l)-(n-1)):
    yield l[i:i+n]

from collections import Counter

frequency = []
input_file = "test.fasta"
for fasta in SeqIO.parse(open(input_file), "fasta"):
  dna = str(fasta.seq)
  freq = Counter(dna)   #get counter of single bases
  freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide
  frequency.append(freq)

对于“AAAAAA”，您可以得到：

Counter({'A': 6, 'AA': 5})

网友

2楼 · 编辑于 2024-05-16 06:02:35

事实上，你扫描字符串的次数远远超过了你需要的20次。对于小的测试序列来说，这可能并不重要，但是当它们变得更大时，这将是显而易见的。我建议采用一种不同的方法来解决重叠带来的副作用问题：

nucleotides = [ 'A', 'T', 'G', 'C' ]
dinucleotides = [ x+y for x in nucleotides for y in nucleotides ]
counts = { x : 0 for x in nucleotides + dinucleotides }

# count the first nucleotide, which has no previous one
n_nucl = 1
prevn = dna[0]
counts[prevn] += 1

# count the rest, along with the pairs made with each previous one
for nucl in dna[1:]:
    counts[nucl] += 1
    counts[prevn + nucl] += 1
    n_nucl += 1
    prevn = nucl

total = 0.0
for nucl in nucleotides:
    pct = counts[nucl] / float(n_nucl)
    total += pct
    print "{} : {} {}%".format(nucl, counts[nucl], pct)
print "Total : {}%".format(total) 

total = 0.0
for dnucl in dinucleotides:
    pct = counts[dnucl] / float(n_nucl - 1)
    total += pct
    print "{} : {} {}%".format(dnucl, counts[dnucl], pct)
print "Total : {}%".format(total)

这种方法只扫描一次字符串，尽管它是公认的更多的代码。。。在

网友

3楼 · 编辑于 2024-05-16 06:02:35

在str.计数不要数重叠的图案。在

示例：

如果你的序列中有'AAAA'，并且你在寻找二核苷酸'AA'，你期望'AAAA'。count（'AA'）会返回3，但它会返回2。所以：

print float('AAAA'.count('AA')) / (len('AAAA') - 1)
0.666666

而不是1

您可以通过以下方式更改计算频率的行：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章