用Python/Biopython计数DNA序列

3 投票

1 回答

1464 浏览

提问于 2025-04-18 00:52

下面这个脚本是用来计算标准FASTA文件中出现的'CCCCAAAA'和'GGGGTTTT'这两个序列的次数：

>contig00001  
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT

这个脚本在这里统计了CCCAAAA序列出现了3次

CCCCAAAACCCCAAAACCCCAAAA（这里的CCCC没有被计算在内）

有没有人能告诉我，如何把最后的CCCC序列算作半次，这样我就能得到3.5的结果。

到目前为止，我的尝试都没有成功。

我的脚本如下...

from Bio import SeqIO

input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')

for cur_record in SeqIO.parse(input_file, "fasta") :


    contig = cur_record.name
    CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
    CCCC_count = cur_record.seq.count('CCCC')

    GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
    GGGG_count = cur_record.seq.count('GGGG')
    #length = len(cur_record.seq)

    splittedContig1=contig.split(CCCCAAAA_count)

    splittedContig2=contig.split(GGGGTTTT_count)

    cnt1=len(splittedContig1)-1
    cnt2=len(splittedContig2)

  cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
  cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count

    output_line = '%s\t%i\t%i\n' % \
    (CONTIG, CCCCAAAA_count, GGGGTTTT_count)


    output_file.write(output_line)

output_file.close()

input_file.close()

模式匹配数据分析 DNA序列生物信息学 biopython fasta文件序列计数

1 个回答

你可以使用分割和列表推导式来实现，方法如下：

contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
splitbase="CCCCAAAA"
halfBase="CCCC"
splittedContig=contig.split(splitbase)
cnt=len(splittedContig)-1
print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])

输出结果：

3.5

根据 CCCCAAAA 来分割字符串。这样会得到一个列表，列表中的元素会去掉 CCCCAAAA。
分割后的长度减去 1 就能得到 CCCCAAAA 出现的次数。
在分割后的元素中，查找以 CCCC 开头的元素。如果找到，就为每次出现加 0.5。

回答于 2025-04-18 由 Python大师

分享举报

用Python/Biopython计数DNA序列

1 个回答

撰写回答