如何创建一个条件无限if循环来将XML解析为数据帧?

2024-04-28 11:40:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将xml文件转换为数据帧以添加到字典中。结构如下

<id =123>
      <table_group>
        <table_1>
          <col_heading1>value1</col_heading1>
          <col_heading2>value2</col_heading2>
          <col_heading3>value3</col_heading3>
        </table_1>
        <table_2>
          <col_heading1> value1 </col_heading1>
          <col_heading2> value2 </col_heading2>
          <col_heading3>
              <sub_col_heading1> value3 </sub_col_heading1>
          </col_heading3>
        </table_2>
      </table_group>
</id>
<id =124>
      <table_group>
        <table_1>
          <col_heading1>value1</col_heading1>
          <col_heading2>value2</col_heading2>
          <col_heading3>value3</col_heading3>
        </table_1>
        <table_2>
          <col_heading1> value1 </col_heading1>
          <col_heading2> value2 </col_heading2>
          <col_heading3>
              <sub_col_heading1> value3 </sub_col_heading1>
          </col_heading3>
        </table_2>
      </table_group>
</id>

这是一个繁重的文件,包含多个具有多列的多个表,位于多个ID的多个表组中

我想为每个表创建一个灵活的循环,为每个表创建一个数据帧,为该表添加每个id及其各自的值。但是,有些表的值应该位于多个级别或子级。见表2,列标题3。此节点有一个子节点sub_col_heading1。我创建了一个代码,如果预期的col_标题只有一个额外的级别,那么它就可以工作,但是我注意到有些表有多个级别

这段代码适用于一个表,理想情况下,如果它更灵活,我可以在多个表之间循环,但是每个表都有不同的结构,它们的子表可能有多个级别

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('xml')
root = tree.getroot()
dffx = pd.DataFrame() #empty dataframe to be filled with parsed XML data for one table

for ids in range(len(root[1].getchildren())): #loops for all the ids in the xml file, about 12000 total.

    data =[] 
    cols = [] #should be the column heading
    values = [] #holds the values for the respective col_heading, to be zipped into a dictionary
    dff1 = pd.DataFrame(data) #create a temp dataframe
    cols.append('InstrumentId') #adds the ID into the dataframe, as the table does not include it
    values = [str(root[1][ids].attrib)[8:-2]] #isolates the id number as it is an attrib
    for i in root[1][ids][0][0].getchildren():#ideally this would be looped as for table root[1][ids][0][table].getchildren(), but root[1][ids][table_group][table2].getchildren() is for one table right now.
        if i.text == '\n            ': # .text should be the value, if i does not have value that means it needs to go down another level
            for x in i.getchildren(): #next level under i, if there is another level then x.text = \n and needs to go down another level, for the same if logic 
                cols.append(x.tag) #grab field name 
                values.append(x.text) #grab value
                cols.append(i.tag) #grab parent field name
                values.append(x.tag) #make parent value = child field name, this logic should apply the more levels down, adding the parent.tag as the child value if another level needs to be added
        else:
            cols.append(i.tag) # if only one level then field name 
            values.append(i.text) #value

    data.append(dict(zip(cols,values)))
    dff1 = dff1.append(data, ignore_index= False)
    dffx= dffx.append(dff1)

嵌套的if语句是我想要创建无限条件循环的内容。如果第一级没有文本值,但有子级,则获取子级值并将父级值设置为子级名称;如果子级没有值但有孙子级,则获取孙子级值,将子级值设置为孙子级名称等


Tags: theididsforifastablecol