提取氨基酸周围的fasta序列
我正在尝试写一个Python脚本,从一个给定的氨基酸周围提取12个氨基酸的序列(每边各6个)来处理一个fasta序列。
输入
我有两个输入:一个fasta文件和一个pandas数据框。
这个fasta文件的格式如下:
> sp|P00001| some text here 1
MKLLILTCLVAVALARPKHPIKKVSPTFDTNMVGKHQGLPQEVLNENLLRFFVAPFPEVFGKEKVSLDAGPGMCSRNE
>sp|P00002| some text here 2
MSSGNAKIGHPAPNFKATAVMPDGQFKDISLSDYKGKYVVFFFYPLDFTFVCPTGLGRSSYRATSCLPALCLP
>sp|P00003| some text here 3
MSVLDSGNFSWKMTEACMKVKIPLVKKKSLRQNLIENGKLKEFMRTHKYNLGSKYIREAATLVSEQPLQN
这是我的第二个输入,一个pandas数据框(有两列 'ProteinID' 和 'Phosphopeptide')
ProteinID -- Phosphopeptide
P00001 -- KVSPT*FDTNMVGK
P00001 -- SLDAGPGMCS*R
P00003 -- LDS*GNFSWKMTEACMK
目标
我需要做的事情是这样的。对于每个 'Phosphopeptide',我需要在fasta文件的头部找到对应的蛋白质(ProteinID),这个头部是以'>'开头的。然后,我需要提取在带有星号标记的氨基酸前后各6个氨基酸。
输出
我的输出是一个新列,写入到数据框中,格式如下:
ProteinID -- Phosphopeptide -- NewColumn
P00001 -- KVSPT*FDTNMVGK -- IKKVSPTFDTNMV
P00001 -- SLDAGPGMCS*R -- AGPGMCSRNE
P00003 -- LDS*GNFSWKMTEACMK -- MSVLDSGNFSWK
请注意,最后两行包含的肽段位于各自蛋白质的开头或结尾,因此在这些情况下我们无法提取到12个氨基酸。
我在写这个程序时遇到了一些困难(编程经验不多),非常希望能得到任何帮助(建议、技巧、函数等)。
2 个回答
0
这里有一个函数,可以提取出相关的子字符串:
def flank(seq, pp):
# 1: find the position of the AA preceding the '*' marker in the
# phosphopeptide
marked_pos = pp.find('*') - 1
if (marked_pos < 0):
raise ValueError("invalid phosphopeptide string")
# 2: find the phosphopeptide (without '*') in the sequence
pp_pos = s.find(pp.replace('*', ''))
if pp_pos == -1:
raise ValueError("phosphopeptide not found in the sequence")
# avoid a negative starting index
start = max(0, pp_pos + marked_pos - 6)
# 3: use slicing to produce the result
return seq[start : pp_pos + marked_pos + 7]
举个例子:
seq = "MKLLILTCLVAVALARPKHPIKKVSPTFDTNMVGKHQGLPQEVLNENLLRFFVAPFPEVFGKEKVSLDAGPGMCSRNE"
pp = "KVSPT*FDTNMVGK"
print(flank(seq, pp))
输出结果是:
IKKVSPTFDTNMV
0
你好,请看看这个:我的fasta文件叫做'txt':
代码片段:
#!/usr/bin/python
import re
protein_dict = [
('P00001', 'KVSPT*FDTNMVGK'),
('P00001', 'SLDAGPGMCS*R'),
('P00003', 'LDS*GNFSWKMTEACMK')
]
protein_id = None
def prepare_structure_from_fasta(file):
fasta_structure = dict()
with open(file, 'r') as fh:
for line in fh:
if '>' in line:
protein_id = line.split('|')[1]
else:
if not protein_id:
raise Exception("Wrong fasta file structure")
fasta_structure[protein_id] = line.strip()
return fasta_structure
def match(pattern, string):
matc = re.search(pattern, string)
if matc:
return matc.groups()[0]
return None
fasta_struct = prepare_structure_from_fasta('txt')
final_struct = []
for pro_d in protein_dict:
pro_id = pro_d[0]
pep_id = pro_d[1]
first, second = pep_id.split('*')
if len(first) <= 6:
f_count = 7 - len(first)
else:
first = first[len(first) - 7:]
f_count = 0
if len(second) <= 6:
s_count = 7 - len(second)
else:
second = second[0:6]
s_count = 0
_regex = '([A-Z]{0,%d}%s%s[A-Z]{0,%d})' % (f_count,first,second,s_count)
final_struct.append((pro_id, pep_id, match(_regex, fasta_struct[pro_id])))
for pro in final_struct:
print pro
输出结果:
('P00001', 'KVSPT*FDTNMVGK', 'IKKVSPTFDTNMV')
('P00001', 'SLDAGPGMCS*R', 'AGPGMCSRNE')
('P00003', 'LDS*GNFSWKMTEACMK', 'MSVLDSGNFSWK')