如何高效地读取大型XML文件并创建自定义对象(Biopython（红杉）

import gzip from Bio import SeqIO file = "/Users/john/workspace/project-2/resources/uniprot_sprot_small.xml.gz" def load_uniprot_records(): records = [] handle = gzip.open(file) for record in SeqIO.parse(handle, "uniprot-xml"): records.append(record) print(record) if __name__ == "__main__": load_uniprot_records()

ID: Q6GZX4 Name: 001R_FRG3G Description: Putative transcription factor 001R Database cross-references: DOI:10.1016/j.virol.2004.02.019, EMBL:AY548484, GO:GO:0046782, GeneID:2947773, InterPro:IPR007031, KEGG:vg:2947773, NCBI Taxonomy:654924, Pfam:PF04947, Proteomes:UP000008770, PubMed:15165820, RefSeq:YP_031579.1, Swiss-Prot:001R_FRG3G, Swiss-Prot:Q6GZX4, SwissPalm:Q6GZX4 Number of features: 2 /dataset=Swiss-Prot /created=2011-06-28 /modified=2019-06-05 /version=37 /accessions=['Q6GZX4'] /recommendedName_fullName=['Putative transcription factor 001R'] /gene_name_ORF=['FV3-001R'] /taxonomy=['Viruses', 'Iridoviridae', 'Alphairidovirinae', 'Ranavirus'] /organism=Frog virus 3 (isolate Goorha) (FV-3) /organism_host=['Ambystoma', 'mole salamanders', 'Dryophytes versicolor', 'chameleon treefrog', 'Lithobates pipiens', 'Northern leopard frog', 'Rana pipiens', 'Notophthalmus viridescens', 'Eastern newt', 'Triturus viridescens', 'Rana sylvatica', 'Wood frog'] /references=[Reference(title='Comparative genomic analyses of frog virus 3, type species of the genus Ranavirus (family Iridoviridae).', ...)] /comment_function=['Transcription activation.'] /proteinExistence=['predicted'] /keywords=['Activator', 'Complete proteome', 'Reference proteome', 'Transcription', 'Transcription regulation'] /type=['ECO:0000305'] /key=['1'] /sequence_length=256 /sequence_mass=29735 /sequence_checksum=B4840739BF7D4121 /sequence_modified=2004-07-19 /sequence_version=1 Seq('MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVE...TPL', ProteinAlphabet())

def load_uniprot_records(): file = "/Users/john/workspace/practical-2/resources/uniprot_sprot_small.xml.gz" seq_records = [] handle = gzip.open(file) for record in SeqIO.parse(handle, "uniprot-xml"): seq_record = SeqRecord(seq=record.seq, id=record.id, name=record.name, description=record.description, annotations=record.annotations) seq_records.append(seq_record) return seq_records

1条回答

网友

1楼 · 发布于 2024-04-29 11:43:04

是的，SeqRecords是非常完整的序列表示，但效率不高。最有效的数据结构可能是一个简单的字典，其中ID是键，值是一个具有固定位置的元组(name, taxonomy, seq_length)（您也可以使用namedtuple，但我认为它们的性能稍差）。然后您甚至不需要存储SeqRecord，只需提取相关信息，然后将其丢弃：

seq_records = {}  # keys remain ordered since Python 3.7 :)
for record in SeqIO.parse(handle, "uniprot-xml"):
    name = record.name
    description = record.description
    seq_length = len(record)
    seq_records[record.id] = (name, description, seq_length)

（我不记得确切的属性名，您可以使用dir(record)或here获得它们，但您已经知道了。）这将使以后的数据处理变得非常有效。您还需要对文件进行有效的解析吗？可能还有其他比使用SeqIO更快的方法，比如ElementTree（How do I parse XML in Python?），但是除非这是一个时间关键的步骤，否则我不会费心

相关问题更多 >

编程相关推荐

热门问题

热门文章