序列中的模式计数
我有一个文本文件,里面有一些序列,格式如下:
>
P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL
>
P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR
>
P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA
......
总共有100个序列。在这些序列中,我用Python脚本搜索了一个我感兴趣的模式,代码如下:
import re
infile=open("seq.fasta",'r')
out=open("results.csv",'w')
pattern=re.compile(r"(P[A-Z]{2}P)")
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name=line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,s))
这个脚本运行得很好,找到了我想要的模式……现在我想统计每个序列中这个模式出现的次数,脚本的输出结果如下:
>
p1 : PGCP
>
p1 : PHCP, PKCP ......
但是我想要的输出格式是这样的:
p1 : 1
>
p1 : 2 ......
有没有人能告诉我怎么用Python实现这个?
1 个回答
findall
方法会返回一个匹配字符串的列表。所以你可以在代码中直接使用 len(s)
,而不是 s
。
out.write('%s:\t%s\n' %(name,len(s)))