使用正则表达式格式化FASTA序列

3 投票
1 回答
1399 浏览
提问于 2025-04-16 16:51

亲爱的生物信息学同仁们!

我正在尝试写一个Python脚本,这个脚本可以读取一个包含非FASTA格式序列的文件,然后把这些序列转换成FASTA格式,最后把所有的序列写入一个包含所有序列的单一文件中。

举个例子:有两个非格式化的序列需要转换成FASTA格式……

非格式化序列 1

1 tcacatctct acgtactgaa tttaaaggct ttttgtcttt ttctcgtttc tttgcttttc 
61 aatgatgttc aagcgtaacc tcggaaaatg tgtacaaact tgagtacaaa tcgccatatt 

还有

非格式化序列 2

1 tcaggagaat gcagatgaca gcagtagcgc accaagtaac cccttttcta acgtcttacg 
61 aagttatggc tcgttaccac attagctata cgacgctctg gcgaagaata aaagatggca

我想把它们转换成这样:

>seq1
TCACATCTCTACGTACTGAATTTAAAGGCTTTTTGTCTTTTTCTCGTTTCTTTGCTTTTC
AATGATGTTCAAGCGTAACCTCGGAAAATGTGTACAAACTTGAGTACAAATCGCCATATT
TACCGTTTTTAGCCAAATTCCATGACACAAACCTAGCTGTAGGCCTTGTTCCTACTGGGT
TTTAGCCAAAACTTGCCTATATTTTTTATGCCAAAAATCGAGAAATGATGGTAAGACGTT
CGCGATTATCTCTAATTGTTTGCCGGTTGAGTTGGTTACCGGTTGCTTTCTTGCTGTCC

>seq2
TCAGGAGAATGCAGATGACAGCAGTAGCGCACCAAGTAACCCCTTTTCTAACGTCTTACG
AAGTTATGGCTCGTTACCACATTAGCTATACGACGCTCTGGCGAAGAATAAAAGATGGCA
GCTTGCCGCAACCTCGTATCAACCGAAATACACGAAACAAGCTGTGGCACATTGAAGACT
TGGAGGAGTATGAGAAGAATTAGGAATAGATAGCGTAGCTTAGTTTTTCTGTTGGAGCTT
GGACTAACGCTTTGAAACGCCGGCTTGTGCCAACAATATAGTTAATATGTACACCAACTT
AGGCTAAGATAGCAGCATGGATTTTTTATTGATTGGATGGATAGGTAAGTGACGACTCCT
CAAGAACGGACAACAGGTATTACAAATGCGTCGATAAAAA

到目前为止,我有这个:

def cleanandFormat(filename, seqName, seq):
"""
writes out the sequence of an irregular sequence format to a file, while cleaning and       formatting it into the standard form
inputs:
    filename - string of a filename
    seqName - string of sequence description
    seq - string of the sequence
output: clean and standard-formatted data to a file.
"""
#sets the blocklength for the max number of characters in a line
blockLength = 60
with open(filename, 'w') as fh:
    #write out the header and sequence name
    fh.write('>' + seqName + '\n')
    for i in range(0, len(seq), blockLength):
        fh.write(seq[i:i+blockLength].upper() + '\n')


#defines the pattern as any digit and any whitespace
pattern = '\d|\s'
#this will replace the pattern found in the sequence with an empty string
replace = ''

seq = ''
filename = 'seqCleanup2.txt'
with open(filename) as fh:
for line in fh:
    seq += re.sub(pattern, replace, line)
    cleanandFormat('testfasta.txt', 'seqX', seq)

1 个回答

0

我不太懂Python,不过这里有个用Ruby写的代码(没测试过),你只需要下载Ruby,把这个代码保存到一个文件里,然后运行它就可以了:

count = 0
while line = gets
  if(line =~ /^1\s[a-z\s]+$/)
    count += 1
    puts
    puts ">seq#{count}"
  end
  if(line =~ /^\d+\s([a-z\s]+)$/)
    puts $1.gsub(/\s/, "").upcase
  end
end

撰写回答