如何将XML数据转换为数据帧?

2024-05-19 01:05:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用python分析XML文件。我需要获取xml数据作为数据帧

import pandas as pd
import xml.etree.ElementTree as et
def parse_XML(xml_file, df_cols): 


 xtree = et.parse(xml_file)
 xroot = xtree.getroot()
 rows = []

 for node in xroot: 
    res = []
    res.append(node.attrib.get(df_cols[0]))
    for el in df_cols[1:]: 
        if node is not None and node.find(el) is not None:
            res.append(node.find(el).text)
        else: 
            res.append(None)
    rows.append({df_cols[i]: res[i] 
                 for i, _ in enumerate(df_cols)})

 out_df = pd.DataFrame(rows, columns=df_cols)
    
 return out_df

parse_XML('/Users/newuser/Desktop/TESTRATP/arrets.xml', ["Name","gml"])

但我正在低于数据帧

    Name    gml

 0  None    None

 1  None    None

 2  None    None

我的XML文件是:

<?xml version="1.0" encoding="UTF-8"?>
<PublicationDelivery version="1.09:FR-NETEX_ARRET-2.1-1.0" xmlns="http://www.netex.org.uk/netex" xmlns:core="http://www.govtalk.gov.uk/core" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:siri="http://www.siri.org.uk/siri" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.netex.org.uk/netex">
    <PublicationTimestamp>2020-08-05T06:00:01+00:00</PublicationTimestamp>
    <ParticipantRef>transport.data.gouv.fr</ParticipantRef>
    <dataObjects>
        <GeneralFrame id="FR:GeneralFrame:NETEX_ARRET:" version="any">
            <members>
                <Quay id="FR:Quay:zenbus_StopPoint_SP_351400003_LOC:" version="any">
                    <Name>ST FELICIEN - Centre</Name>
                    <Centroid>
                        <Location>
                            <gml:pos srsName="EPSG:2154">828054.2068251468 6444393.512041969</gml:pos>
                        </Location>
                    </Centroid>
                    <TransportMode>bus</TransportMode>
                </Quay>
                <Quay id="FR:Quay:zenbus_StopPoint_SP_361350004_LOC:" version="any">
                    <Name>ST FELICIEN - Chemin de Juny</Name>
                    <Centroid>
                        <Location>
                            <gml:pos srsName="EPSG:2154">828747.3172982805 6445226.100290826</gml:pos>
                        </Location>
                    </Centroid>
                    <TransportMode>bus</TransportMode>
                </Quay>
                <Quay id="FR:Quay:zenbus_StopPoint_SP_343500005_LOC:" version="any">
                    <Name>ST FELICIEN - Darone</Name>
                    <Centroid>
                        <Location>
                            <gml:pos srsName="EPSG:2154">829036.2709757038 6444724.878001894</gml:pos>
                        </Location>
                    </Centroid>
                    <TransportMode>bus</TransportMode>
                </Quay>
                <Quay id="FR:Quay:zenbus_StopPoint_SP_359440004_LOC:" version="any">
                    <Name>ST FELICIEN - Col de Fontayes</Name>
                    <Centroid>
                        <Location>
                            <gml:pos srsName="EPSG:2154">829504.7993360173 6445490.57188837</gml:pos>
                        </Location>
                    </Centroid>
                    <TransportMode>bus</TransportMode>
                </Quay>
   </members>
        </GeneralFrame>
    </dataObjects>
</PublicationDelivery>
       

我在这里给了您xml文件的一小部分。我无法提供完整的XML文件,因为它超出了stackoverflow中的字符限制。我想知道为什么我得到了上面的输出,我不知道我的错误在哪里。我是新手,有人能帮我吗?多谢各位


Tags: nameposnonehttpdfversionquaywww
2条回答

我的方法是避免xml解析,通过使用xmlplain从xml生成JSON直接切换到pandas

import xmlplain
with open("so_sample.xml") as f: js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True)
df1 = pd.json_normalize(js).explode("PublicationDelivery.dataObjects.GeneralFrame.members")

# cleanup column names...
df1 = df1.rename(columns={c:c.replace("PublicationDelivery.", "").replace("dataObjects.GeneralFrame.","").strip() 
                          for c in df1.columns})
# drop spurious columns
df1 = df1.drop(columns=[c for c in df1.columns if c[0]=="@"])
# expand second level of dictionaries
df1 = pd.json_normalize(df1.to_dict(orient="records"))
# cleanup columns from second set of dictionaries
df1 = df1.rename(columns={c:c.replace("members.Quay.", "") for c in df1.columns})
# expand next list and dicts
df1 = pd.json_normalize(df1.explode("Centroid.Location.gml:pos").to_dict(orient="records"))
# there are some NaNs - dela with them
df1["Centroid.Location.gml:pos.@srsName"].fillna(method="ffill", inplace=True)
df1["Centroid.Location.gml:pos"].fillna(method="bfill", inplace=True)
# de-dup
df1 = df1.groupby("@id", as_index=False).first()

# more columns than requested... for SO output
print(df1.loc[:,["Name", "Centroid.Location.gml:pos.@srsName", "Centroid.Location.gml:pos"]].to_string(index=False))


输出

                          Name Centroid.Location.gml:pos.@srsName            Centroid.Location.gml:pos
          ST FELICIEN - Darone                          EPSG:2154  829036.2709757038 6444724.878001894
          ST FELICIEN - Centre                          EPSG:2154  828054.2068251468 6444393.512041969
 ST FELICIEN - Col de Fontayes                          EPSG:2154   829504.7993360173 6445490.57188837
  ST FELICIEN - Chemin de Juny                          EPSG:2154  828747.3172982805 6445226.100290826

使用xml读取的替代解决方案

pip install pandas-read-xml
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

df = pdx.read_xml(xml, ['PublicationDelivery', 'dataObjects', 'GeneralFrame', 'members']).pipe(fully_flatten)

该列表只是您希望作为“根”导航到的标记。 之后需要清理列名

相关问题 更多 >

    热门问题