使用python将复杂xml转换为csv

2024-04-27 16:51:44 发布

您现在位置:Python中文网/ 问答频道 /正文

<app>
<doc>
<field name="id">013</field>
<field name="groupid">013</field>
<field name="img_url">8b4</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest</field>
</doc>
<doc>
<field name="id">0131</field>
<field name="groupid">013</field>
<field name="img_url">8b</field>
<field name="filetype">HTML</field>
<field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
<field name="topic">accurate</field>
<field name="topic">additional</field>
<field name="topic">agriculture</field>
<field name="topic">area</field>
<field name="topic">biggest1</field>
<field name="topic">biggest2</field>
<field name="topic">biggest3</field>
</doc>
</app>

我有一个类似的xml,我需要在python中将其转换为csv。是否有人知道如何执行此操作,并且不同文档的字段名主题不同,csv标题应与字段名相似,对于主题,它应位于单个单元格中,并用逗号分隔

预期产量 enter image description here


Tags: namehttpscomidappurlfieldimg
2条回答

您可以使用XML解析器,在解析元素数据时发出元素数据以构建csv。在每个结束标记上,可以向行添加值,也可以写入行本身。iterparse的一个优点是,在处理之前不需要将整个文档加载到内存中

import xml.etree.ElementTree as ET
import io
import csv

field_names = ["id", "groupid", "img_url", "filetype", "url", "topic"]
field_names_set = set(field_names)

with open("test.csv", "w", newline="") as out_file:
    writer = csv.DictWriter(out_file, field_names)
    writer.writeheader()
    row = {}
    topic = []
    for event, elem in ET.iterparse("test.xml"): # iterate tag end events
        if elem.tag == "doc":
            # doc elem end, write row to csv and setup for next
            row["topic"] = ",".join(topic)
            writer.writerow(row)
            row = {}
            topic = []            
        elif elem.tag == "field":
            # field elem end, add to current row
            if elem.attrib["name"] == "topic":
                topic.append(elem.text)
            else:
                row[elem.attrib["name"]] = elem.text

下面创建了一个类似csv的输出。这就是你要找的吗?
请注意,您无法区分哪个字段是“主题”,哪个字段是非“主题”

import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<app>
   <doc>
      <field name="id">013</field>
      <field name="groupid">013</field>
      <field name="img_url">8b4</field>
      <field name="filetype">HTML</field>
      <field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/</field>
      <field name="topic">accurate</field>
      <field name="topic">additional</field>
      <field name="topic">agriculture</field>
      <field name="topic">area</field>
      <field name="topic">biggest</field>
   </doc>
   <doc>
      <field name="id">0131</field>
      <field name="groupid">013</field>
      <field name="img_url">8b</field>
      <field name="filetype">HTML</field>
      <field name="url">https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward</field>
      <field name="topic">accurate</field>
      <field name="topic">additional</field>
      <field name="topic">agriculture</field>
      <field name="topic">area</field>
      <field name="topic">biggest1</field>
      <field name="topic">biggest2</field>
      <field name="topic">biggest3</field>
   </doc>
</app>'''
root = ET.fromstring(xml)
first_time = True
headers = set()
for doc in root.findall('.//doc'):
    data = []
    for field in doc.findall('field'):
        if first_time:
            headers.add(field.attrib['name'])
        data.append((field.attrib['name'], field.text))
    if first_time:
        print(','.join(sorted(list(headers))))
        first_time = False
    print(','.join(y[1] for y in sorted(data, key=lambda x: x[0])))

输出

filetype,groupid,id,img_url,topic,url
HTML,013,013,8b4,accurate,additional,agriculture,area,biggest,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward-u-s-/
HTML,013,0131,8b,accurate,additional,agriculture,area,biggest1,biggest2,biggest3,https://calgaryherald.com/pmn/business-pmn/sally-rumbles-toward

相关问题 更多 >