如何使用python从一个大的fasta文件中提取蛋白质序列的子集?

2024-04-28 11:41:28 发布

您现在位置:Python中文网/ 问答频道 /正文

如果蛋白质ID列在txt文件(Interested proteins.txt)中,我想从新文件(swissprot_canonical-isoforms.fasta)中的.fasta文件(swissprot_canonical-isoforms.fasta)中提取蛋白质序列的子集

下面显示了swissprot_canonical-isoforms.fasta中的部分蛋白质序列。在以“>;”开头的行中,蛋白质ID显示在两个“|”之间。例如,“P04637”是一个蛋白质ID

>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
>sp|P04637-2|P53_HUMAN Isoform 2 of Cellular tumor antigen p53 OS=Homo sapiens GN=TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQDQTSFQKENC
>sp|P04637-3|P53_HUMAN Isoform 3 of Cellular tumor antigen p53 OS=Homo sapiens GN=TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQMLLDLRWCYFLINSS

以下是Interested proteins.txt中列出的一些蛋白质ID

Q6ZWH5
Q8NG66
P51955
P51957
P04629

最终输出应如下所示(仅列出Q6ZWH5的序列作为示例):

>sp|Q6ZWH5|NEK10_HUMAN Serine/threonine-protein kinase Nek10 OS=Homo sapiens GN=NEK10 PE=2 SV=3
MPDQDKKVKTTEKSTDKQQEITIRDYSDLKRLRCLLNVQSSKQQLPAINFDSAQNSMTKS
EPAIRAGGHRARGQWHESTEAVELENFSINYKNERNFSKHPQRKLFQEIFTALVKNRLIS
REWVNRAPSIHFLRVLICLRLLMRDPCYQEILHSLGGIENLAQYMEIVANEYLGYGEEQH
TVDKLVNMTYIFQKLAAVKDQREWVTTSGAHKTLVNLLGARDTNVLLGSLLALASLAESQ
ECREKISELNIVENLLMILHEYDLLSKRLTAELLRLLCAEPQVKEQVKLYEGIPVLLSLL
HSDHLKLLWSIVWILVQVCEDPETSVEIRIWGGIKQLLHILQGDRNFVSDHSSIGSLSSA
NAAGRIQQLHLSEDLSPREIQENTFSLQAACCAALTELVLNDTNAHQVVQENGVYTIAKL
ILPNKQKNAAKSNLLQCYAFRALRFLFSMERNRPLFKRLFPTDLFEIFIDIGHYVRDISA
YEELVSKLNLLVEDELKQIAENIESINQNKAPLKYIGNYAILDHLGSGAFGCVYKVRKHS
GQNLLAMKEVNLHNPAFGKDKKDRDSSVRNIVSELTIIKEQLYHPNIVRYYKTFLENDRL
YIVMELIEGAPLGEHFSSLKEKHHHFTEERLWKIFIQLCLALRYLHKEKRIVHRDLTPNN
IMLGDKDKVTVTDFGLAKQKQENSKLTSVVGTILYSCPEVLKSEPYGEKADVWAVGCILY
QMATLSPPFYSTNMLSLATKIVEAVYEPVPEGIYSEKVTDTISRCLTPDAEARPDIVEVS
SMISDVMMKYLDNLSTSQLSLEKKLERERRRTQRYFMEANRNTVTCHHELAVLSHETFEK
ASLSSSSSGAASLKSELSESADLPPEGFQASYGKDEDRACDEILSDDNFNLENAEKDTYS
EVDDELDISDNSSSSSSSPLKESTFNILKRSFSASGGERQSQTRDFTGGTGSRPRPALLP
LDLLLKVPPHMLRAHIKEIEAELVTGWQSHSLPAVILRNLKDHGPQMGTFLWQASAGIAV
SQRKVRQISDPIQQILIQLHKIIYITQLPPALHHNLKRRVIERFKKSLFSQQSNPCNLKS
EIKKLSQGSPEPIEPNFFTADYHLLHRSSGGNSLSPNDPTGLPTSIELEEGITYEQMQTV
IEEVLEESGYYNFTSNRYHSYPWGTKNHPTKR

有没有办法用python实现这一点?任何帮助都将不胜感激


Tags: 文件txtidos蛋白质spfastahuman
1条回答
网友
1楼 · 发布于 2024-04-28 11:41:28

您可以使用pyfasta实现这一点,这是python中FASTA格式的接口

from pyfasta import Fasta

f = Fasta('fasta.fa') # open the file

targets = {"P04637"} # define your target IDs

selection = {}

for key in f:
    candidateKey = key.split("|")[1]
    if candidateKey in targets:
        selection[key] = f[key][:]
        print(key)
        print(selection[key])

输出:

sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAP
TPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAM
AIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKK
KPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

相关问题 更多 >