如何在python中解析具有相同子标记的xml文件?

2024-05-26 20:46:18 发布

您现在位置:Python中文网/ 问答频道 /正文

<?xml version="1.0"?>
<BioSampleSet>
  <BioSample accession="SAMN01347139" id="1347139" submission_date="2012-09-21T22:44:26.843" last_update="2012-09-21T22:44:26.843" publication_date="2012-09-21T22:44:26.843" access="controlled-access">
    <Ids>
      <Id is_primary="1" db="BioSample">SAMN01347139</Id>
      <Id db="dbGaP" is_hidden="1" db_label="Sample name">44-21834</Id>
    </Ids>
    <Description>
      <Title>DNA sample from a human male participant in the dbGaP study "Framingham SHARe Thyroid and Hormone Data"</Title>
      <Organism taxonomy_name="Homo sapiens" taxonomy_id="9606"/>
    </Description>
    <Owner>
      <Name abbreviation="NCBI"/>
    </Owner>
    <Models>
      <Model>Generic</Model>
    </Models>
    <Package display_name="Generic">Generic.1.0</Package>
    <Attributes>
      <Attribute display_name="gap accession" harmonized_name="gap_accession" attribute_name="gap_accession">phs000044</Attribute>
      <Attribute display_name="submitter handle" harmonized_name="submitter_handle" attribute_name="submitter handle">Framingham_SHARe</Attribute>
      <Attribute display_name="biospecimen repository" harmonized_name="biospecimen_repository" attribute_name="biospecimen repository">Framingham_SHARe</Attribute>
      <Attribute display_name="study name" harmonized_name="study_name" attribute_name="study name">Framingham SHARe Thyroid and Hormone Data</Attribute>
      <Attribute display_name="biospecimen repository sample id" harmonized_name="biospecimen_repository_sample_id" attribute_name="biospecimen repository sample id">21834</Attribute>
      <Attribute display_name="submitted sample id" harmonized_name="submitted_sample_id" attribute_name="submitted sample id">21834</Attribute>
      <Attribute display_name="submitted subject id" harmonized_name="submitted_subject_id" attribute_name="submitted subject id">21834</Attribute>
      <Attribute display_name="gap sample id" harmonized_name="gap_sample_id" attribute_name="gap_sample_id">105542</Attribute>
      <Attribute display_name="gap subject id" harmonized_name="gap_subject_id" attribute_name="gap_subject_id">28577</Attribute>
      <Attribute display_name="sex" harmonized_name="sex" attribute_name="sex">male</Attribute>
      <Attribute display_name="analyte type" harmonized_name="analyte_type" attribute_name="analyte type">DNA</Attribute>
      <Attribute display_name="subject is affected" harmonized_name="subject_is_affected" attribute_name="subject is affected"/>
      <Attribute display_name="gap consent code" harmonized_name="gap_consent_code" attribute_name="gap_consent_code">1</Attribute>
      <Attribute display_name="gap consent short name" harmonized_name="gap_consent_short_name" attribute_name="gap_consent_short_name">GRU</Attribute>
    </Attributes>
    <Status when="2012-09-21T22:44:26.843" status="suppressed"/>
  </BioSample>
</BioSampleSet>

我想用编程的方式解析上面给定的xml文件。我尝试过使用lxml,但是在提取<Attributes>标记中的键和值时遇到了问题,因为所有子标记都命名为Attribute。有人有什么建议吗。 我尝试使用“Attributes”作为正则表达式拆分文本,但是由于整个文件是一行,因此结果列表是来自指定部分的字母表列表。 我用的是python。并且<Attribute>标记的数量可能会随时间而变化。 我当前正在使用以下代码:

^{pr2}$

Tags: samplenameidisrepositorydisplayattributeconsent

热门问题