使用正则表达式格式化FASTA序列
亲爱的生物信息学同仁们!
我正在尝试写一个Python脚本,这个脚本可以读取一个包含非FASTA格式序列的文件,然后把这些序列转换成FASTA格式,最后把所有的序列写入一个包含所有序列的单一文件中。
举个例子:有两个非格式化的序列需要转换成FASTA格式……
非格式化序列 1
1 tcacatctct acgtactgaa tttaaaggct ttttgtcttt ttctcgtttc tttgcttttc
61 aatgatgttc aagcgtaacc tcggaaaatg tgtacaaact tgagtacaaa tcgccatatt
还有
非格式化序列 2
1 tcaggagaat gcagatgaca gcagtagcgc accaagtaac cccttttcta acgtcttacg
61 aagttatggc tcgttaccac attagctata cgacgctctg gcgaagaata aaagatggca
我想把它们转换成这样:
>seq1
TCACATCTCTACGTACTGAATTTAAAGGCTTTTTGTCTTTTTCTCGTTTCTTTGCTTTTC
AATGATGTTCAAGCGTAACCTCGGAAAATGTGTACAAACTTGAGTACAAATCGCCATATT
TACCGTTTTTAGCCAAATTCCATGACACAAACCTAGCTGTAGGCCTTGTTCCTACTGGGT
TTTAGCCAAAACTTGCCTATATTTTTTATGCCAAAAATCGAGAAATGATGGTAAGACGTT
CGCGATTATCTCTAATTGTTTGCCGGTTGAGTTGGTTACCGGTTGCTTTCTTGCTGTCC
>seq2
TCAGGAGAATGCAGATGACAGCAGTAGCGCACCAAGTAACCCCTTTTCTAACGTCTTACG
AAGTTATGGCTCGTTACCACATTAGCTATACGACGCTCTGGCGAAGAATAAAAGATGGCA
GCTTGCCGCAACCTCGTATCAACCGAAATACACGAAACAAGCTGTGGCACATTGAAGACT
TGGAGGAGTATGAGAAGAATTAGGAATAGATAGCGTAGCTTAGTTTTTCTGTTGGAGCTT
GGACTAACGCTTTGAAACGCCGGCTTGTGCCAACAATATAGTTAATATGTACACCAACTT
AGGCTAAGATAGCAGCATGGATTTTTTATTGATTGGATGGATAGGTAAGTGACGACTCCT
CAAGAACGGACAACAGGTATTACAAATGCGTCGATAAAAA
到目前为止,我有这个:
def cleanandFormat(filename, seqName, seq):
"""
writes out the sequence of an irregular sequence format to a file, while cleaning and formatting it into the standard form
inputs:
filename - string of a filename
seqName - string of sequence description
seq - string of the sequence
output: clean and standard-formatted data to a file.
"""
#sets the blocklength for the max number of characters in a line
blockLength = 60
with open(filename, 'w') as fh:
#write out the header and sequence name
fh.write('>' + seqName + '\n')
for i in range(0, len(seq), blockLength):
fh.write(seq[i:i+blockLength].upper() + '\n')
#defines the pattern as any digit and any whitespace
pattern = '\d|\s'
#this will replace the pattern found in the sequence with an empty string
replace = ''
seq = ''
filename = 'seqCleanup2.txt'
with open(filename) as fh:
for line in fh:
seq += re.sub(pattern, replace, line)
cleanandFormat('testfasta.txt', 'seqX', seq)
1 个回答
0
我不太懂Python,不过这里有个用Ruby写的代码(没测试过),你只需要下载Ruby,把这个代码保存到一个文件里,然后运行它就可以了:
count = 0
while line = gets
if(line =~ /^1\s[a-z\s]+$/)
count += 1
puts
puts ">seq#{count}"
end
if(line =~ /^\d+\s([a-z\s]+)$/)
puts $1.gsub(/\s/, "").upcase
end
end