如何在python中快速删除大型fasta文件中的重复序列?

2024-05-13 10:45:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我从NCBI数据库下载了一些fasta文件,它们包含了超过10000个序列。在

文件如下:

>lcl|AY289593.1_prot_AAQ74417.1_1 [protein=FabH-like protein] [protein_id=AAQ74417.1] [location=complement(<1..775)] [gbkey=CDS]
MRPINDIQVDGVPNDHTIVQSDYISFTEADEPATVMATRAATEALTTSELVSADVGVLIYAAIIGDAHHF
APVCHVQRVLRAPDALAFELSAASNGGTQGIAVAANLMTADAPVKAALVCTAYRHPIDIISRWSSGMVFG
DGAAAAVLSRDGGMVRLISGYHGSLPELEVLARNRSNERLGFVLPDVGLGKYLTAIARMYQAVIAQVLEE
AQTSIAEIDYFGLIGIGIPSLTATILEPNGIPVNKTSWGLLRQMGHVG
>lcl|AY289593.1_prot_AAQ74418.1_2 [protein=type I polyketide synthase loading module] [protein_id=AAQ74418.1] [location=4126..>6747] [gbkey=CDS]
MLGDAVAVVGMSCRVPGASDPDALWALLRDGISVVDEIPSARWNLDGLVAHRLTDEQRSALRHGAFLDDV
EGFDAAFFGINPSEAGSMDPQQRLMLELTWAALEDARIVPEHLSGSSSGVFTGAMSDDYTTAVTYRAAMT
AHTFAGTHRSLIANRVSYTLGLRGPSLVIDTGQSSSLVAVHVAMESLRREETSLAIAGGIHLNLSLAAAL
SAAHFGALSPDGRCYTFDARANGYVRGEGGGVVVLKRLNDALADGNHIYCVIRGSSVNNDGATQDLTAPG
VDGQRQALLQAYERAEIDPSEVQYVELHGTGTRLGDPTEAHSLHSVFGTSTVPRSPLLVGSIKTNIGHLE
GAAGILGLIKTALAVHHRQLPPSLNYTVPNPKIPLEQLGLRVQTTLSEWPDLDKPLTAGVSSFSMGGTNA
HLILQQPPTPDTTQTPNPTTGSDPAVGSDSAVGSDPAVGVLVWPLSARSAPGLSAQAARLYQHLSAHPDL
DPIDVAHSLATTRSHHPHRATITTSIEHHSENNHDTTDALAALHALANNGTHPLLSRGLLTPQGPGKTVF
VFPGQGSQYPGMGADLYRQFPVFAHALDEVAAALNPHLDVALLEVMFSQQDTAMAQLLDQTFYAQPALFA
LGTALHRLFTHAGIHPDYLLGHSIGELTAAYAAGVLSLQDAATLVTSRGRLMQSCTPGGTMLALQASEAE
VQPLLEGLDHAVSIAAINGATSIVLSGDHDSLEQIGEHFITQDRRTTRLQVSHAFHSPHMDPILEQFRQI
AAQLTFSAPTLPILSNLTGQIARHDQLASPDYWTQQLRNTVRFHDTVAALLGAGEQVFLELSPHPVLTQA
ITDTVEQAGGGGAAVPALRKDRPDAVAFAAALGQ

>lcl|AY289596.1_prot_AAQ74421.1_1 [protein=type I polyketide synthase extender module] [protein_id=AAQ74421.1] [location=<1..>4439] [gbkey=CDS]
DTACSSSLVAIHLACQSLRNNESQLALAGGVTVMSTPAVFTEFSRQRGLAPDGRCKAFAATADGTGFGEG
AAVLVLERLSEARRNNHPVLAIVAGSAINQDGASNGLTAPHGPSQQRVINQALANAGLTHDQVDAVEAHG
TGTTLGDPIEAGALHATYGHHHTPDQPLWLGSIKSNIGHTQAAAGAVGVVKMIQAITHATLPATLHVDQP
GPHIDWSSGTVRLLTEPIQWPNTNHPRTAAVSSFGISGTNAHLILQQPPTPNPTQTPEDCSPAQSPCATI
TDAGTGLSFVPWVISAKSAEALSAQASRLLTRLDDDPVVDAIDLGWSLIATRSMFEHRAVVVGADRHQLQ
RGLAELASGNLGADVVVGRARAAGETVMVFPGQGSQRLGMGAQLYEQFPVFAAAFDDVVDALDQYLRLPL
RQVMWGDDEGLLNSTEFAQPSLFAVEVALFALLRFWGVVPDYVIGHSVGELAAAQVAGVLSLQDAAKLVS
ARGRLMQALPAGGAMVAVAASQHEVEPLLVEGVDIAALNAPGSVVISGDQAAVRLIANRLADRGYRAHEL

我没有在这里列出完整的文件,因为它很大,而且包含重复项(注意“prot”后面的字符串),所以我写了一个这样的脚本来删除重复项:

^{pr2}$

它能起作用,但速度很慢。在

我认为应该有一个更聪明的方法来做到这一点,任何专家都能帮上忙吗?谢谢您!在


Tags: 文件id数据库typencbilocationfastamodule