使用python将多个xml文件转换为csv

2024-05-12 20:50:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从XML中提取特定的标记并转换为CSV文件。我能够为一个XML文件提取所有的标识符标记。 这里我的问题是1)如何从多个XML文件提取到单个CSV文件,2)在给定的XML文件中多次提到所需的标记,我想知道如何从每个记录标记列表中提取第一个标识符标记

我正在使用python3.7

所需的ans是:

<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

注意:我不是程序员!!谢谢你的帮助

from bs4 import BeautifulSoup as b
import itertools
import os
import csv
import pandas as pd


os.chdir(r"C:*test")

with open("aaaaahbc.xml", "r", encoding="utf-8") as f: # opening xml file
    content = f.read()

soup = b(content, 'lxml')
identifier =  [ values.text for values in soup.findAll("identifier")]

# For python-3.x use `zip_longest` method
# For python-2.x use 'izip_longest method

data = [item for item in itertools.zip_longest(identifier)] 
df  = pd.DataFrame(data=data)
df.to_csv("aaaaahbc.csv",index=True, header=False)

xml文件示例:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2020-06-12T05:26:49Z</responseDate>
 <request verb="ListRecords" resumptionToken="2020-05-23T03:32:50Z!2037-01-01T00:00:00Z!!oai_dc!7334186!7353566!oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31648">
    http://union.ndltd.org:8080/union.OAI-PMH/</request>
 <ListRecords>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Influencia de la grasa en las propiedades físicas y sensoriales de galletas. Alternativas para la mejora del perfil de acidos grasos</dc:title>
<dc:creator>Tarancón Serrano, Paula Isabel</dc:creator>
<dc:contributor>Salvador Alcaraz, Ana</dc:contributor>
<dc:contributor>Sanz Taberner, Teresa</dc:contributor>
<dc:contributor>Tarrega Guillem, Amparo</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Escuela Técnica Superior del Medio Rural y Enología - Escola Tècnica Superior del Medi Rural i Enologia</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Instituto Universitario de Ingeniería de Alimentos para el Desarrollo - Institut Universitari d'Enginyeria d'Aliments per al Desenvolupament</dc:contributor>
<dc:subject>Galletas</dc:subject>
<dc:subject>Grasa</dc:subject>
<dc:subject>Propiedades sensoriales</dc:subject>
<dc:subject>Propiedades físicas</dc:subject>
<dc:subject>Mejora del perfil de ácidos grasos</dc:subject>
<dc:date>2013-09-02</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/31652</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/31652</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/31652</identifier>
  <datestamp>2020-05-22T09:32:33Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Sensores químicos cromogénicos y fluorogénicos para la detección de cationes y aniones</dc:title>
<dc:creator>Ábalos Aguado, Tatiana</dc:creator>
<dc:contributor>Martínez Mañez, Ramón</dc:contributor>
<dc:contributor>Sancenón Galarza, Félix</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Química - Departament de Química</dc:contributor>
<dc:subject>Sensores cromogénicos</dc:subject>
<dc:subject>Sensores fluorogénicos</dc:subject>
<dc:subject>Cationes</dc:subject>
<dc:subject>Aniones</dc:subject>
<dc:subject>Química supramolecular</dc:subject>
<dc:subject>QUIMICA INORGANICA</dc:subject>
<dc:subject>QUIMICA ORGANICA</dc:subject>
<dc:date>2013-10-07</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32667</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32667</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32667</identifier>
  <datestamp>2020-05-22T10:52:59Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Comparison of vacuum treatments and traditional cooking in vegetables using instrumental and sensory analysis</dc:title>
<dc:creator>Iborra Bernad, María del Consuelo</dc:creator>
<dc:contributor>García Segovia, Purificación</dc:contributor>
<dc:contributor>Martínez Monzó, Javier</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Tecnología de Alimentos - Departament de Tecnologia d'Aliments</dc:contributor>
<dc:subject>Instrumental texture</dc:subject>
<dc:subject>Puncture test</dc:subject>
<dc:subject>Kramer cell test</dc:subject>
<dc:subject>Texture Profile Analysis</dc:subject>
<dc:subject>Color</dc:subject>
<dc:subject>Antioxidants</dc:subject>
<dc:subject>Anthocyanins</dc:subject>
<dc:subject>Carotenes</dc:subject>
<dc:subject>Ascorbic acid</dc:subject>
<dc:subject>Microstructure</dc:subject>
<dc:subject>Cooking treatment</dc:subject>
<dc:subject>Response Surface Methodology</dc:subject>
<dc:subject>Optimization</dc:subject>
<dc:subject>Sensory Analysis</dc:subject>
<dc:subject>Ranking test</dc:subject>
<dc:subject>Paired test</dc:subject>
<dc:subject>Just About Right</dc:subject>
<dc:subject>Flash Profile</dc:subject>
<dc:subject>Vacuum cooking</dc:subject>
<dc:subject>Sous-vide</dc:subject>
<dc:subject>Cook-vide</dc:subject>
<dc:subject>Vegetables</dc:subject>
<dc:subject>Purple-flesh potatoes</dc:subject>
<dc:subject>Carrots</dc:subject>
<dc:subject>Green beans</dc:subject>
<dc:subject>Red cabbage.</dc:subject>
<dc:subject>TECNOLOGIA DE ALIMENTOS</dc:subject>
<dc:description>Alfresco</dc:description>
<dc:date>2013-10-21</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32953</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32953</dc:identifier>
<dc:language>eng</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32953</identifier>
  <datestamp>2020-05-22T09:18:49Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Anàlisi del discurs de la informàtica: aplicació a l'estudi de la descripció</dc:title>
<dc:creator>Montesinos López, Anna Isabel</dc:creator>
<dc:contributor>SALVADOR LIERN, VICENT MANUEL</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Lingüística Aplicada - Departament de Lingüística Aplicada</dc:contributor>
<dc:subject>Discurso</dc:subject>
<dc:subject>Informática</dc:subject>
<dc:subject>FILOLOGIA CATALANA</dc:subject>
<dc:date>2015-11-03</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/56906</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/56906</dc:identifier>
<dc:language>cat</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/56906</identifier>
  <datestamp>2020-05-22T07:41:11Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Herramientas para la generación y evaluación ex-ante de modelos de negocio.</dc:title>
<dc:creator>Mateu Céspedes, José María</dc:creator>
<dc:contributor>March Chordà, Isidre</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Ingeniería e Infraestructura de los Transportes - Departament d'Enginyeria i Infraestructura dels Transports</dc:contributor>
<dc:subject>Modelos de negocio</dc:subject>
<dc:subject>Evaluación ex-ante</dc:subject>
<dc:subject>INGENIERIA E INFRAESTRUCTURA DE LOS TRANSPORTES</dc:subject>
<dc:date>2015-11-10</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/57282</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/57282</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/57282</identifier>
  <datestamp>2020-05-22T10:29:52Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
<resumptionToken completeListSize="7353566" cursor="7334186">2020-05-29T15:07:21Z!2037-01-01T00:00:00Z!!oai_dc!7335298!7353566!oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34876</resumptionToken> </ListRecords>
</OAI-PMH>

Tags: orghttpeswwwdedcsubjectoai
1条回答
网友
1楼 · 发布于 2024-05-12 20:50:46

此脚本将遍历目录(*.xml)中的每个XML,并提取<record>标记下的第一个<identifier>

import csv
import glob
from bs4 import BeautifulSoup

all_data = []
for filename in glob.glob(r'*.xml'):
    with open(filename, 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'html.parser')
    print(filename)
    for i in soup.select('record identifier:nth-child(1)'):
        print(i)
        all_data.append([filename, i.get_text(strip=True)])

# write to csv file:
with open('data.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        csv_writer.writerow(row)

打印(例如):

a1.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
a2.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

并保存data.csv(LibreOffice的屏幕截图):

enter image description here

相关问题 更多 >