用python处理复杂的XML数据

2024-05-14 22:14:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个复杂的xml,其中列在tbd:ColDef中定义。在tbd:RepBodyRow中,填充值时不带标记。我们从上游收到这个消息,并使用python加载数据

我期待以下结果:

Place         |Parent    |Type 
USA           |USA       |Country 
New York      |USA       |State
NYC           |New York  |City
Manhattan     |NYC       |Town....
<?xml version="1.0" encoding="UTF-8"?>
<tbd:Wrapper xmlns:tbd="abc.com/tbd">
<tbd:Root xmlns:tbd="abc.com/tbd">
<tbd:JobCnxt>
    <tbd:Job Date="2021-07-02" Country="USA" />
</tbd:JobCnxt>
<tbd:JobRep RepName="Country Data">
    <tbd:RepDef>
    <tbd:ColDef name="default">
        <tbd:ColDefData Name="place" Type="string" DisplayName="Place"/>
        <tbd:ColDefData Name="placeType" Type="string" DisplayName="Type"/>
    </tbd:ColDef>
    </tbd:RepDef>
    <tbd:RepBody>
    <tbd:RepBodyGroup GroupTypeName="default" GroupDefName="default" GroupDisplayName="">
        <tbd:RepBodyGroup GroupTypeName="distinct" GroupDefName="place" GroupDisplayName="USA">
        <tbd:RepBodyRow ColDefName="default" IsTotalRow="true">
            <tbd:CellData Value="USA"/>
            <tbd:CellData Value="Country"/>
        </tbd:RepBodyRow>
        <tbd:RepBodyGroup GroupTypeName="distinct" GroupDefName="place" GroupDisplayName="New York">
            <tbd:RepBodyRow ColDefName="default" IsTotalRow="true">
            <tbd:CellData Value="New York"/>
            <tbd:CellData Value="State"/>
            </tbd:RepBodyRow>
            <tbd:RepBodyGroup/>
            <tbd:RepBodyGroup GroupTypeName="distinct" GroupDefName="place" GroupDisplayName="NYC">
            <tbd:RepBodyRow ColDefName="default" IsTotalRow="true">
            <tbd:CellData Value="NYC"/>
            <tbd:CellData Value="City"/>
            </tbd:RepBodyRow>
            <tbd:RepBodyRow ColDefName="default">
            <tbd:CellData Value="Manhattan"/>
            <tbd:CellData Value="Town"/>
            </tbd:RepBodyRow>
            <tbd:RepBodyRow ColDefName="default">
            <tbd:CellData Value="Bronx"/>
            <tbd:CellData Value="Town"/>
            </tbd:RepBodyRow>
            <tbd:RepBodyRow ColDefName="default">
            <tbd:CellData Value="Brooklyn"/>
            <tbd:CellData Value="Town"/>
            </tbd:RepBodyRow>
            <tbd:RepBodyRow ColDefName="default">
            <tbd:CellData Value="Queens"/>
            <tbd:CellData Value="Town"/>
            </tbd:RepBodyRow>
        </tbd:RepBodyGroup>
        </tbd:RepBodyGroup>
    </tbd:RepBodyGroup>
    </tbd:RepBodyGroup>
    </tbd:RepBody>
</tbd:JobRep>
</tbd:Root>
</tbd:Wrapper>
    

Tags: defaultnewvaluetypeplacecountrytbdyork
2条回答

考虑到相关的地理位置,这种xml没有最合理的结构,而且它使用名称空间的事实也没有帮助。但你可以靠得足够近

另一方面,考虑到结构,没有USA行,因为美国在这个意义上没有父实体(作为顶级实体)。此外,仅供参考和FWIW,例如,曼哈顿不是一个{}中的{},它是一个自治区

尽管如此:

import pandas as pd
from lxml import etree
data = """your xml above"""
doc = etree.XML(data.encode())
columns = ["Place","Type","Parent"]
rows = []
for pair in doc.xpath('//x:RepBodyGroup/x:RepBodyRow',namespaces=ns)[1:]:
    item = pair.xpath('.//x:CellData/@Value',namespaces=ns) 
    group = pair.xpath('./preceding-sibling::x:RepBodyRow[@IsTotalRow="true"][1]/x:CellData[1]/@Value',namespaces=ns)
    group2 = pair.xpath('../preceding-sibling::x:RepBodyRow[@IsTotalRow="true"][1]/x:CellData[1]/@Value',namespaces=ns)
    item.append(group[0] if len(group)>0 else group2[0])
    rows.append(item)
pd.DataFrame(rows,columns=columns)

输出:

Place   Type    Parent
0   New York    State   USA
1   NYC     City    New York
2   Manhattan   Town    NYC
3   Bronx   Town    NYC
4   Brooklyn    Town    NYC
5   Queens  Town    NYC

这与@jackfleeting的答案类似,但它使用get()来获取属性值,而不是直接用xpath选择属性值

from lxml import etree
import pandas as pd

tree = etree.parse("input.xml")

ns = {"tbd": "abc.com/tbd"}
cols = ["Place", "Parent", "Type"]
rows = list()

for rbr in tree.xpath(".//tbd:RepBodyGroup/tbd:RepBodyRow", namespaces=ns):
    row = list()

    # Get "Place"
    row.append(rbr.xpath("tbd:CellData[1]", namespaces=ns)[0].get("Value", "N/A"))

    # Get "Parent"
    # There won't always be a preceding sibling RepBodyRow.
    try:
        parent = rbr.xpath("../preceding-sibling::tbd:RepBodyRow/tbd:CellData[1]",
                           namespaces=ns)[0].get("Value", "N/A")
    except IndexError:
        parent = "N/A"
    row.append(parent)

    # Get "Type"
    row.append(rbr.xpath("tbd:CellData[2]", namespaces=ns)[0].get("Value", "N/A"))
    rows.append(row)

print(pd.DataFrame(rows, columns=cols))

打印输出

       Place     Parent      Type
0        USA        N/A   Country
1   New York        USA     State
2        NYC   New York      City
3  Manhattan        NYC  Sub City
4   County 1  Manhattan    County
5     City 1  Manhattan      Town
6     City 2  Manhattan      Town
7     City 3  Manhattan      Town
8   County 1  Manhattan    County

相关问题 更多 >

    热门问题