如何使用re.findall遍历SeqIO创建的列表，FASTA输入

0 投票

1 回答

45 浏览

数据工程师

提问于 2025-04-13 20:53

我在这个问题上花了太多时间了，已经超过10个小时了。

输入是一个fasta格式的文件。输出应该是一个文本文件，里面包含基因ID和匹配的模式（有三种不同的模式）。

我本来想自己写一个函数，这样就不用重复写三遍相同的代码，但现在我放弃了，只好写了三遍（而且运行得很好）。

有没有办法用这个代码：

records = list(SeqIO.parse('mytextfile.fasta', 'fasta'))

来替代我现在重复使用的三遍代码（在下面）或者用其他的函数？这是一个学校的作业，所以也不应该太复杂，但我必须使用Bio和re模块来解决这个问题。

from Bio import SeqIO
import re

outfile = 'sekvenser.txt'

for seq_record in SeqIO.parse('prot_sequences.fasta', 'fasta'):
    match = re.findall(r'W.P', str(seq_record.seq), re.I)
    if match:       
        with open(outfile, 'a') as f:
            record_string = str(seq_record.id)
            newmatch = str(match)
            result = record_string+'\t'+newmatch
            print(result)
            f.write(result + '\n')

我试过这个

records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))
new_list = []
i = r'W.P'

for i in records:
    match = re.findall(i)
    if match:       
        new_list.append(match)

print(new_list)

但它只告诉我findall()缺少一个必需的位置参数：'string'。

在我看来，i是一个字符串（因为我定义了这个变量）。显然我做错了什么。如果我尝试插入我在其他代码中使用的seq_record，它告诉我seq_record没有定义。我不明白在代码中i后面应该放什么。

text processing regex fasta seqio bioinformatics gene identification pattern matching coding efficiency

1 个回答

输入文件 prot_sequences.fasta :

>one
WWWWWWCCCCCPPPPPP
>two
SSSSSSSSRRRRRRRRTTTTWWWWWWDDDDDDMMMMM
>three
QQQQQQQQQWAPTCCCCCCCWYPGGGGGGGGGGGGGG

代码 :

import Bio

from Bio import SeqIO

import re


print('biopython version : ', Bio.__version__)

records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))
new_list = []
i = r'W.P'

#print(i)

for rec in records:
    
    #print('rec : ', rec, rec.seq , type(rec.seq)
    match = re.findall(i, str(rec.seq) )
    if match:       
        new_list.append(match)

print(new_list)

输出 :

biopython version :  your Biopython Version
[['WAP', 'WYP']]

如果你取消注释这行：#print('rec : ', rec, rec.seq , type(rec.seq)

你会发现 rec.seq 是一个 <class 'Bio.Seq.Seq'> 的对象，所以不适合用作传给 re.findall(pattern, string, flags=0) 的参数。

records = [str(rec.seq) for rec in SeqIO.parse('prot_sequences.fasta', 'fasta')]

print( records, type(records))

虽然这样可以得到一个字符串列表，但你会失去 rec.id，这是你作业需要的。

可以查看 Bio.SeqIO.parse(handle, format, alphabet=None) 来了解这个对象的属性，它可以：

将一个序列文件转换为一个返回 SeqRecords 的迭代器。

你可以这样做：

records = [[rec.id , re.findall(i, str(rec.seq) , re.I)] for rec in SeqIO.parse('prot_sequences.fasta', 'fasta') if len(re.findall(i, str(rec.seq) , re.I)) >= 1]

print(records)

得到：

[['three', ['WAP', 'WYP']]]

但我不确定这是否是最好的方法：实际上很难阅读，我也不确定它是否快，因为（不太确定，但它使用了 re.findall 两次，见 Python 会自动优化/缓存函数调用吗？）；所以为了更快但更难看，可以使用：

records = [[rec.id , a] for rec in SeqIO.parse('prot_sequences.fasta', 'fasta') if len(a:= re.findall(i, str(rec.seq) , re.I) ) >= 1]

print(records)

输出：

[['three', ['WAP', 'WYP']]]

关于使用 list()，下面的几种写法会得到相同的结果：

records = list([[rec.id , re.findall(i, str(rec.seq) , re.I)] for rec in SeqIO.parse('prot_sequences-2.fasta', 'fasta') if len(re.findall(i, str(rec.seq) , re.I)) >= 1])

print(records)


records = list([rec.id , a] for rec in SeqIO.parse('prot_sequences-2.fasta', 'fasta') if len(a:= re.findall(i, str(rec.seq) , re.I) ) >= 1)

print(records)

结果：

[['three', ['WAP', 'WYP']]]
[['three', ['WAP', 'WYP']]]

回答于 2025-04-13 由 Python大师

分享举报

如何使用re.findall遍历SeqIO创建的列表，FASTA输入

1 个回答

撰写回答