从创建表SeqIO.parse公司python

2024-06-01 00:44:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我以前问过类似的问题,但之前的解决方案并没有解决我的问题。我一直在琢磨和测试一些东西,但没有一个能正常工作。在

我有一个包含500多个序列的fasta文件,从这个文件中我需要构建一个表,所以我试图编写一个脚本来完成它,而不是用复制粘贴手工完成。 我正在使用Biopython阅读文件:seq=SeqIO.parse(handle, "fasta")

从每个序列中我想知道蛋白质序列所属的物种、蛋白质的名称和Uniprot ID。当我用SeqIO解析fasta文件时,我注意到没有多少信息可以从中解析出来。在

以下是我的fasta文件的一个子集:

>gi|194757291|ref|XP_001960898.1| GF11270 [Drosophila ananassae] >gi|190622196|gb|EDV37720.1| GF11270 [Drosophila ananassae] MSAARTSQDCDCTAKCRLRQHGNTITAALTKRSISSQNLAAFVYKTCGNFANILDDLGRSAVHMSASTGRYEILEWLLNH GAYINGQDYESGSSPLHRALYYGSIDCAVLLLRYGASMELLDEDTCCPLQAICRKCDVDDFATDSQNDVLVWGSNKNYNL GVGSEQNTNAPQSVDFFRKSNIWIEQVALGAYHSLFLDKKGHLYAVGHGKGGRLGTGGENTLPAPKRVKVSSKLGSEDSI RCISVSRQHSLVLTHRSLVFACGLNSDCQLGVRDAPEHLAQFKEVVALRDKGASDLVRVIACDQHSIAYGSRCVYVWGAN QGQFGISANIASIVVPTLIKLPARTTIRFVEANNAATVIYSEEKMIYLYYAEKTRAIKTPNYEDLKSISVMGGHIKNSAK GSAAALKLLMLTETNVVYLWYENTQQFYRCNFLPIRLPQIKKILYKCNQVMVLSEDGCVYRGKCNQIALPASELQEKSRP NLDIWQNNDQNRTEISREHVIRIELQRVPNIDRAVDISCDEGFSSFAVLQESQGKYFRKPTLPRKEHSFKKLLHDTSDCD AVHDVVFHVDGEKYPAHKYIIYSRAPGLRELVRMYLDKDIYLNFENLTGKMFELVLKHIYTNYWPTEDDIDCIQQSLGPA NPQNRSRTCQMFLPHLEKFQLTELAKYVKSYVQDHQFPLPSARQRLPRLHRSDYPELYDVKIKCEDGQVLQAHKCMLVAR LEYFEMMFMHSWAERSSVTMEGVPAEYMEPVLDYLYSLEAEAFCKQAYLETFLYNMITICDQYFIESLQNLCELLILDKI SIRKCGEMLEFATMYNCKLLLKGCMDFICQNLARVLCYRSIEQCDGETLKCLNDHYRNMFSRVFDYRQITPFSEAIEDEL LLSFIDGLEVDLEYRMDAESKAKQAAKTKQKDLRKLNARHQYEQRAISSMMRSISISESNPAPEVATSPQESARSETNNW SRVIDKKEQKRKQAETALKVNKTLKQETSPEPEMVPIERTPVNEQTPPPLSPETEPSTPLNKSYNLDFSSLTPQSQKLSQ KQRKRLSSESKSWRGNSSALLESPTTPVPVPNAWGVTTTPSSSFNDSYTSPTTGSSSDPTSFANMMRSQAASSSATSKDQ SQNFSKILADERRQRESYERMRNKSLVHTQIEETAIAELREFYNVDNIDDEKITIARKSRPSDINFSTWIRQ
>gi|198456847|ref|XP_001360463.2| GA20796 [Drosophila pseudoobscura pseudoobscura] >gi|198135774|gb|EAL25038.2| GA20796 [Drosophila pseudoobscura pseudoobscura] MSTAKAQEYDCTAKCTCRQHGNSITAALTKRSIDNQNLGAFIFKTCGNFANIIDDLGRSAVHMSASVARYEILEWLLNHG AYINGLDYESGSSPLHRALYYGSIDCAVLLLRYGASLELLDEDTRCPLQAICRKCDEDFTTESQNDVLVWGSNKNYNLGI GNEQNTNAPQAVDFFRKSNIWIEQVALGAYHSLFCDKKGHLYAVGHGKGGRLGIGVENSLPAPKRVKVSSKLNDDSIMCI SVSRQHSLLLTRRSLVFACGINTDHQLGVRDAPENLTQFREVVALRDKGASDLLRVIACDQHSIAYSTKCVYVWGANQGQ FGISRTTDTIMAPTLIKLPARTSIRFVEANNAATVIYTEEKMITLFYGDKTRYIKTPNYEDLKSIAVIGGHLKSSTKGSA AALKLLMLTETNVVFLWYENTQQFYRCNFSPIRLPEIKKILYKCNQVLILSLDGCVYRGKCNQIALPAGILEEKSKPNMD IWHNNDQNRTEISREHVIRIELQRVPNIDRATDIFCDESFSSFAVLQESHMKYFRKPPLPRREHNFKKLYHDTCESDAVH DVVFHVDGERFAAHKFILYSRAPGLRELTRIYLDKDVYLNFENLTGKMFELILKYIYTSYWPTEDDIDCIQESLGPANPR ERSRACEMFIPHLEMFQLVDLARYLQSYVRDNQFPIPSTRQRFNRLHRSDYPELYDVRIVCEDSKVLEAHKCMLVSRLEY FEMMFTHSWAERTTVNMEGVPAEYMEPVLDYLYSLDTEAFCKQNYTETFLYNMVTFCDQYFIESLQNVCESLILDKISIR KCGEMLDFAAMYNCKLLHKGCMDFICHNLARVLCYRSIEQCDEATLKCLNDHYRKMFSNVFDYRQITPFSEAIEDELLLS FVVDCDIDLDYRMDPETKLKAAAKHKQKDLRRQDARHYYEQQAISSMMRSLSVSESASGPEATTGPQESTRSEGKNWSRV VDKKEQKRKLADTALKVNNTLKLEEPPRPELEVIERALMKEQTPPPTSPAEETSTPLSKSYNLDLSSLTPQSQKLSQKQR KRLSSESKSWRSPLVEQEPTTPVAVPNAWGLPPATPSSSSFTDSPATGSISDPTSFANMMRGQAAAATTPTEKGQSFSRI LADERRQRESFERMRNKSLAHTQIEETAIAELREFYNVDNTDDETITIERKSRPTDINFSTWLKH
>gi|355695434|gb|AES00009.1| inhibitor of Bruton agammaglobulinemia tyrosine kinase [Mustela putorius furo] KPGNKLKLNQKKCSFLCDVTMKSVDGKEFTCHKCVLCARLEYFHSMLSSSWIEASTCTALEMPIHSDILKVILDYLYTDE AVVIKESQNVDFVCSVLVVADQLLITRLKGMCEVALTEKLTLKNAAMILEFAAMYNAEQLKLSCLQFIGLNM

我有没有办法从这些序列中得到蛋白质名称、单蛋白ID和有机体? 例如,我想从顺序说明然后用这个ID在基因库里搜索,但我认为这是不可能的,不是所有的序列都有基因库的ID。 有什么建议吗?任何帮助都将不胜感激。在

期望输出示例:

^{2}$

Tags: 文件名称refid序列蛋白质xpfasta