我正在尝试使用Smith–Waterman algorithm在Python中实现本地序列对齐。
这是我目前所拥有的。它甚至可以构建similarity matrix:
import sys, string
from numpy import *
f1=open(sys.argv[1], 'r')
seq1=f1.readline()
f1.close()
seq1=string.strip(seq1)
f2=open(sys.argv[2], 'r')
seq2=f2.readline()
f2.close()
seq2=string.strip(seq2)
a,b =len(seq1),len(seq2)
penalty=-1;
point=2;
#generation of matrix for local alignment
p=zeros((a+1,b+1))
# table calculation and matrix generation
for i in range(1,a+1):
for j in range(1,b+1):
vertical_score =p[i-1][j]+penalty;
horizontal_score= p[i][j-1]+penalty;
if seq1[i-1]==seq2[j-1]:
diagonal_score =p[i-1][j-1]+point;
else:
diagonal_score = p[i-1][j-1]+penalty;
p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score);
print p
例如,对于两个序列:
agcacact
acacacta
我的代码输出相似矩阵:
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 2. 1. 2. 1. 2. 1. 0. 2.]
[ 0. 1. 1. 1. 1. 1. 1. 0. 1.]
[ 0. 0. 3. 2. 3. 2. 3. 2. 1.]
[ 0. 2. 2. 5. 4. 5. 4. 3. 4.]
[ 0. 1. 4. 4. 7. 6. 7. 6. 5.]
[ 0. 2. 3. 6. 6. 9. 8. 7. 8.]
[ 0. 1. 4. 5. 8. 8. 11. 10. 9.]
[ 0. 0. 3. 4. 7. 7. 10. 13. 12.]]
现在我陷入了算法的下一步,回溯构造最佳对齐。
To obtain the optimum local alignment, we start with the highest value in the matrix (i, j). Then, we go backwards to one of positions (i − 1, j), (i, j − 1), and (i − 1, j − 1) depending on the direction of movement used to construct the matrix. We keep the process until we reach a matrix cell with zero value, or the value in position (0, 0).
我很难决定要回到哪个位置。维基百科所说的“取决于用来构建矩阵的移动方向”是什么意思?我如何在Python中实现它?
我终于做到了
import sys, string
from numpy import*
import re
# read first sequence
fasta_sequence1=open(sys.argv[1], 'r')
seq1=""
for i in fasta_sequence1:
if i.startswith(">"):
pass
else:
seq1 = seq1 + i.strip()
fasta_sequence1.close()
fasta_sequence2=open(sys.argv[2], 'r')
seq2 = ""
for i in fasta_sequence2:
if i.startswith('>'):
pass
else:
seq2 = seq2+ i.strip()
fasta_sequence2.close()
a,b =len(seq1),len(seq2)
penalty=-1;
point=2;
#generation of matrix for local alignment
p=zeros((a+1,b+1))
#intialization of max score
max_score=0;
#pointer to store the traceback path
pointer=zeros((a+1,b+1))
# table calculation and matrix generation
for i in range(1,a+1):
for j in range(1,b+1):
vertical_score =p[i-1][j]+penalty;
horizontal_score= p[i][j-1]+penalty;
if seq1[i-1]==seq2[j-1]:
diagonal_score =p[i-1][j-1]+point;
else:
diagonal_score = p[i-1][j-1]+penalty;
for i in range(1,a+1):
for j in range(1,b+1):
p[i][j]=max(0,vertical_score,horizontal_score,diagonal_score);
for i in range(1,a+1):
for j in range(1,b+1):
if p[i][j]==0:
pointer[i][j]=0; #0 means end of the path
if p[i][j]==vertical_score:
pointer[i][j]=1; #1 means trace up
if p[i][j]==horizontal_score:
pointer[i][j]=2; #2 means trace left
if p[i][j]==diagonal_score:
pointer[i][j]=3; #3 means trace diagonal
if p[i][j]>=max_score:
maximum_i=i;
maximum_j=j;
max_score=p[i][j];
#for i in range(1,a+1):
# for j in range(1,b+1):
#if p[i][j]>max_score:
#max_score=p[i][j]
print max_score
# traceback of all the pointers
align1,align2='',''; #initial sequences
i,j=max_i,max_j; #indices of path starting point
while pointer[i][j]!=0:
if pointer[i][j]==3:
align1=align1+seq1[i-1];
align2=align2+seq2[j-1];
i=i-1;
j=j-1;
elif pointer[i][j]==2:
align1=align1+'-';
align2=align2+seq2[j-1]
j=j-1;
elif pointer[i][j]==1:
align1=align1+seq1[i-1];
align2=align2+'-';
i=i-1;
align1=align1[::-1]; #reverse sequence 1
align2=align2[::-1]; #reverse sequence 2
#output_file = open(sys.argv[3],"w")
#output_file.write(align1)
#output_file.write(align2)
print align1
print align2
但我认为有更好更有效的方法
output_file = open(sys.argv[3],"w")
output_file.write(align1)
output_file.write(align2)
结果显示
agcacacta-cacact
与此相反的是:
print align1
print align2
显示正确的输出:
agcacact
a-cacact
如何在文件编写器中获得上述输出。谢谢
在构建相似度矩阵时,不仅需要存储相似度得分,还需要存储该得分来自的位置。您当前有一行代码:
所以这里你需要记住的不仅仅是最大得分,而是其中哪个是最大得分。当你来做回溯时,你就会知道该往哪个方向走。
例如,您可以尝试以下方法:
相关问题 更多 >
编程相关推荐