用Python优化XML解析成CSV

3条回答

网友

1楼 · 编辑于 2024-04-25 22:40:18

所以问题似乎是熊猫数据帧和序列的使用。使用上面的代码，处理一个包含约4000条记录的xml文件需要4-120秒。时间随着程序的运行而增加。你知道吗

使用python列表或numpy矩阵（在csv中工作更方便）显著减少了运行时间-现在每个xml文件处理最多需要0.1-0.5秒。你知道吗

我每次都使用下面的代码附加新的已处理记录

records = np.append(records, new_redocrds, axis=0)

这相当于：

tempdf = tempdf.append(s, ignore_index=True)

但要快得多。你知道吗

希望这能帮助任何可能遇到类似问题的人！你知道吗

网友

2楼 · 编辑于 2024-04-25 22:40:18

实际上考虑一下XSLT，这是一种特殊用途的语言，用于将XML文件转换成其他XML甚至CSV之类的文本文件。唯一需要的第三方库是Python的lxml，它可以运行xslt1.0脚本，省去了诸如Pandas和Numpy等更重、更广泛的分析工具。你知道吗

事实上，因为XSLT是一种独立的行业语言，所以它是可移植的，可以用XSLT库（例如Java、PHP、Perl、C#、VB）或独立的1.0、2.0或3.0processors（例如Xalan、Saxon）在任何语言中运行，所有这些都可以作为命令行调用。你知道吗

XSLT（下面另存为一个.xsl文件，一个特殊的.xml文件）

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                              xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"
                              xmlns:ns3="http://www.siri.org.uk/siri" 
                              xmlns:ns4="http://www.ifopt.org.uk/acsb" 
                              xmlns:ns5="http://www.ifopt.org.uk/ifopt" 
                              xmlns:ns6="http://datex2.eu/schema/1_0/1_0" 
                              xmlns:ns7="http://new.webservice.namespace">

   <xsl:output method="text" indent="yes" omit-xml-declaration="yes"/>
   <xsl:strip-space elements="*"/>

   <xsl:template match ="/S:Envelope/S:Body/ns7:GetStopMonitoringServiceResponse/Answer">
       <xsl:apply-templates select="ns3:StopMonitoringDelivery"/>
   </xsl:template>

   <xsl:template match="ns3:StopMonitoringDelivery">
        <!  HEADERS  >
        <!  <xsl:text>RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime&#xa;</xsl:text>  >
        <xsl:apply-templates select="ns3:MonitoredStopVisit"/>
        <xsl:text>&#xa;</xsl:text>
   </xsl:template>

   <xsl:template match="ns3:MonitoredStopVisit">
       <xsl:variable name="delim">,</xsl:variable>
       <xsl:variable name="quote">&quot;</xsl:variable>
       <!  DATA ROWS  >
       <xsl:value-of select="concat($quote, ns3:RecordedAtTime, $quote, $delim,
                                    $quote, ns3:MonitoringRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:LineRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DirectionRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:PublishedLineName, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OperatorRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:DestinationRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:OriginAimedDepartureTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Longitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleLocation/ns3:Latitude, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:VehicleRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:StopPointRef, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:ExpectedArrivalTime, $quote, $delim,
                                    $quote, ns3:MonitoredVehicleJourney/ns3:MonitoredCall/ns3:AimedArrivalTime, $quote, $delim
                                    )"/>
   </xsl:template>

</xsl:stylesheet>

Online Demo

Python（无附加列表、数组或数据帧）

import glob                 # TO RETRIEVE ALL XML FILES
import lxml.etree as et     # TO PARSE XML AND RUN XSLT

xml_path = "/path/to/xml/files"

# PARSE XSLT
xsl = et.parse('XSLTScript.xsl')

# BUILD CSV
with open("MonitoredStopVisits.csv", 'w') as f:
    # HEADER
    f.write('RecordedAtTime,MonitoringRef,LineRef,DirectionRef,PublishedLineName,'
            'OperatorRef,DestinationRef,OriginAimedDepartureTime,Longitude,Latitude,'
            'VehicleRef,StopPointRef,ExpectedArrivalTime,AimedArrivalTime\n')

    # DATA ROWS
    for f in glob.glob(xml_path + "/**/*.xml", recursive=True):
        # LOAD XML AND XSL SCRIPT
        xml = et.parse(f)

        # TRANSFORM XML TO STRING RESULT TREE
        transform = et.XSLT(xsl)
        result = str(transform(xml))

        # WRITE TO CSV
        f.write(result)

网友

3楼 · 编辑于 2024-04-25 22:40:18

以下是我对熊猫的解决方案：

每个5Mb文件的计算时间约为0.4s

import xml.etree.ElementTree as ET
import re
import pandas as pd
import os



def collect_data(xml_file):
    # create xml object
    root = ET.parse(xml_file).getroot()

    # collect raw data
    out_data = []
    for element in root.iter():
        # get tag name
        tag = re.sub('{.*?}', '', element.tag)
        # add break segment element
        if tag == 'RecordedAtTime':
            out_data.append('break')

        if tag in tag_list:
            out_data.append((tag, element.text))

    # get break indexes
    break_index = [i for i, x in enumerate(out_data) if x == 'break']

    # break list into parts
    list_data = []
    for i in range(len(break_index) - 1):
        list_data.append(out_data[break_index[i]:break_index[i + 1]])

    # check for each value in data
    final_data = []
    for item in list_data:
        # delete bleak element ad convert list into dictionary
        del item[item.index('break')]
        data_dictionary = dict(item)

        if 'RecordedAtTime' in data_dictionary.keys():
            recorded_at_time = data_dictionary.get('RecordedAtTime')
        else:
            recorded_at_time = ''

        if 'MonitoringRef' in data_dictionary.keys():
            monitoring_ref = data_dictionary.get('MonitoringRef')
        else:
            monitoring_ref = ''

        if 'LineRef' in data_dictionary.keys():
            line_ref = data_dictionary.get('LineRef')
        else:
            line_ref = ''

        if 'DirectionRef' in data_dictionary.keys():
            direction_ref = data_dictionary.get('DirectionReff')
        else:
            direction_ref = ''

        if 'PublishedLineName' in data_dictionary.keys():
            published_line_name = data_dictionary.get('PublishedLineName')
        else:
            published_line_name = ''

        if 'OperatorRef' in data_dictionary.keys():
            operator_ref = data_dictionary.get('OperatorRef')
        else:
            operator_ref = ''

        if 'DestinationRef' in data_dictionary.keys():
            destination_ref = data_dictionary.get('DestinationRef')
        else:
            destination_ref = ''

        if 'OriginAimedDepartureTime' in data_dictionary.keys():
            origin_aimed_departure_time = data_dictionary.get('OriginAimedDepartureTime')
        else:
            origin_aimed_departure_time = ''

        if 'Longitude' in data_dictionary.keys():
            longitude = data_dictionary.get('Longitude')
        else:
            longitude = ''

        if 'Latitude' in data_dictionary.keys():
            latitude = data_dictionary.get('Latitude')
        else:
            latitude = ''

        if 'VehicleRef' in data_dictionary.keys():
            vehicle_ref = data_dictionary.get('VehicleRef')
        else:
            vehicle_ref = ''

        if 'StopPointRef' in data_dictionary.keys():
            stop_point_ref = data_dictionary.get('StopPointRef')
        else:
            stop_point_ref = ''

        if 'ExpectedArrivalTime' in data_dictionary.keys():
            expected_arrival_time = data_dictionary.get('ExpectedArrivalTime')
        else:
            expected_arrival_time = ''

        if 'AimedArrivalTime' in data_dictionary.keys():
            aimed_arrival_time = data_dictionary.get('AimedArrivalTime')
        else:
            aimed_arrival_time = ''

        final_data.append((recorded_at_time, monitoring_ref, line_ref, direction_ref, published_line_name, operator_ref,
                       destination_ref, origin_aimed_departure_time, longitude, latitude, vehicle_ref,
                       stop_point_ref,
                       expected_arrival_time, aimed_arrival_time))

     return final_data


# setup tags list for checking
tag_list = ['RecordedAtTime', 'MonitoringRef', 'LineRef', 'DirectionRef', 'PublishedLineName', 'OperatorRef',
            'DestinationRef', 'OriginAimedDepartureTime', 'Longitude', 'Latitude', 'VehicleRef', 'StopPointRef',
            'ExpectedArrivalTime', 'AimedArrivalTime']

# collect data from each file
save_data = []
for file_name in os.listdir(os.getcwd()):
    if file_name.endswith('.xml'):
        save_data.append(collect_data(file_name))
    else:
        pass

# merge list of lists
flat_list = []
for sublist in save_data:
    for item in sublist:
        flat_list.append(item)

# load data into data frame
data = pd.DataFrame(flat_list, columns=tag_list)

# save data to file
data.to_csv('data.csv', index=False)

相关问题更多 >

编程相关推荐

热门问题

热门文章