我正在尝试将xml文件转换为数据帧以添加到字典中。结构如下
<id =123>
<table_group>
<table_1>
<col_heading1>value1</col_heading1>
<col_heading2>value2</col_heading2>
<col_heading3>value3</col_heading3>
</table_1>
<table_2>
<col_heading1> value1 </col_heading1>
<col_heading2> value2 </col_heading2>
<col_heading3>
<sub_col_heading1> value3 </sub_col_heading1>
</col_heading3>
</table_2>
</table_group>
</id>
<id =124>
<table_group>
<table_1>
<col_heading1>value1</col_heading1>
<col_heading2>value2</col_heading2>
<col_heading3>value3</col_heading3>
</table_1>
<table_2>
<col_heading1> value1 </col_heading1>
<col_heading2> value2 </col_heading2>
<col_heading3>
<sub_col_heading1> value3 </sub_col_heading1>
</col_heading3>
</table_2>
</table_group>
</id>
这是一个繁重的文件,包含多个具有多列的多个表,位于多个ID的多个表组中
我想为每个表创建一个灵活的循环,为每个表创建一个数据帧,为该表添加每个id及其各自的值。但是,有些表的值应该位于多个级别或子级。见表2,列标题3。此节点有一个子节点sub_col_heading1。我创建了一个代码,如果预期的col_标题只有一个额外的级别,那么它就可以工作,但是我注意到有些表有多个级别
这段代码适用于一个表,理想情况下,如果它更灵活,我可以在多个表之间循环,但是每个表都有不同的结构,它们的子表可能有多个级别
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('xml')
root = tree.getroot()
dffx = pd.DataFrame() #empty dataframe to be filled with parsed XML data for one table
for ids in range(len(root[1].getchildren())): #loops for all the ids in the xml file, about 12000 total.
data =[]
cols = [] #should be the column heading
values = [] #holds the values for the respective col_heading, to be zipped into a dictionary
dff1 = pd.DataFrame(data) #create a temp dataframe
cols.append('InstrumentId') #adds the ID into the dataframe, as the table does not include it
values = [str(root[1][ids].attrib)[8:-2]] #isolates the id number as it is an attrib
for i in root[1][ids][0][0].getchildren():#ideally this would be looped as for table root[1][ids][0][table].getchildren(), but root[1][ids][table_group][table2].getchildren() is for one table right now.
if i.text == '\n ': # .text should be the value, if i does not have value that means it needs to go down another level
for x in i.getchildren(): #next level under i, if there is another level then x.text = \n and needs to go down another level, for the same if logic
cols.append(x.tag) #grab field name
values.append(x.text) #grab value
cols.append(i.tag) #grab parent field name
values.append(x.tag) #make parent value = child field name, this logic should apply the more levels down, adding the parent.tag as the child value if another level needs to be added
else:
cols.append(i.tag) # if only one level then field name
values.append(i.text) #value
data.append(dict(zip(cols,values)))
dff1 = dff1.append(data, ignore_index= False)
dffx= dffx.append(dff1)
嵌套的if语句是我想要创建无限条件循环的内容。如果第一级没有文本值,但有子级,则获取子级值并将父级值设置为子级名称;如果子级没有值但有孙子级,则获取孙子级值,将子级值设置为孙子级名称等
目前没有回答
相关问题 更多 >
编程相关推荐