在使用Python解析和操作包含外部文本文件内容的XML文件时遇到了困难

2024-04-29 14:58:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我花了几个月的时间试图完成以下任务。你知道吗

我有一个Excel生成的XML文件,它捕获了一个数据库,我一直致力于建立一个网站。我的梦想是将这个XML文件的这种或某种被操纵的形式导入到WordPress中,这样我就不必再一个接一个地手动编辑每一篇文章或网页(特别是当我做了一个影响网站中多个或所有页面/文章内容的更改时)。你知道吗

Excel文件(我称之为“test\u of \u 2016-09-19.xml”)如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Row>
        <Entry_No>1</Entry_No>
        <Waterfall_Name>Bridalveil Fall</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern and Central Sierras</Subregion>
        <locale___political_or_official>Mariposa County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
        <scenic_rating>4.5</scenic_rating>
        <difficulty_rating>1</difficulty_rating>
        <distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
        <time_commitment>20 minutes</time_commitment>
        <GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
        <date_first_visited>1999-09-04</date_first_visited>
        <date_last_visited>2011-06-04</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
        <Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>52</Entry_No>
        <Waterfall_Name>Switzer Falls</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern California</Subregion>
        <locale___political_or_official>Los Angeles County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
        <scenic_rating>2</scenic_rating>
        <difficulty_rating>3.5</difficulty_rating>
        <distance>4.6 miles round trip (to base of main drop)</distance>
        <time_commitment>3.5 hours (to base of main drop)</time_commitment>
        <GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
        <date_first_visited>2003-02-02</date_first_visited>
        <date_last_visited>2016-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
        <File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
        <Introduction>introduction-switzer-falls.html</Introduction>
        <Directions>directions-switzer-falls.html</Directions>
        <Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
        <Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
        <Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
        <Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
        <Post_Slug>california-switzer-falls.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <Continent___Super_Region>Asia</Continent___Super_Region>
        <Country>China</Country>
        <locale___political_or_official>Guangxi</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
        <scenic_rating>4</scenic_rating>
        <difficulty_rating>1.5</difficulty_rating>
        <distance>1km round trip</distance>
        <time_commitment>30-45 minutes</time_commitment>
        <GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
        <date_first_visited>2009-04-23</date_first_visited>
        <date_last_visited>2009-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
        <Post_Slug>asia-detian-waterfall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>1125</Entry_No>
    </Row>
    <Row>
        <Entry_No>1126</Entry_No>
    </Row>
    <Row>
        <Entry_No>1127</Entry_No>
    </Row>
</Root>

我想做的是,如果存在特定的元素或标记(特别是文件目录、简介、方向),那么打开指向的文件,抓取它们的文本内容,并将它们放置在新的元素或标记中,如简介、方向、正文等),然后写出新的修改过的XML文件。你知道吗

新的XML文件如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Row>
        <Entry_No>1</Entry_No>
        <Waterfall_Name>Bridalveil Fall</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern and Central Sierras</Subregion>
        <locale___political_or_official>Mariposa County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
        <scenic_rating>4.5</scenic_rating>
        <difficulty_rating>1</difficulty_rating>
        <distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
        <time_commitment>20 minutes</time_commitment>
        <GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
        <date_first_visited>1999-09-04</date_first_visited>
        <date_last_visited>2011-06-04</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
        <Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>52</Entry_No>
        <Waterfall_Name>Switzer Falls</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern California</Subregion>
        <locale___political_or_official>Los Angeles County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
        <scenic_rating>2</scenic_rating>
        <difficulty_rating>3.5</difficulty_rating>
        <distance>4.6 miles round trip (to base of main drop)</distance>
        <time_commitment>3.5 hours (to base of main drop)</time_commitment>
        <GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
        <date_first_visited>2003-02-02</date_first_visited>
        <date_last_visited>2016-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
        <File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
        <Introduction>introduction-switzer-falls.html</Introduction>
        <Introduction_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Introduction_Body>
        <Directions>directions-switzer-falls.html</Directions>
        <Directions_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Directions_Body>
        <Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
        <Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
        <Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
        <Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
        <Post_Slug>california-switzer-falls.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <Continent___Super_Region>Asia</Continent___Super_Region>
        <Country>China</Country>
        <locale___political_or_official>Guangxi</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
        <scenic_rating>4</scenic_rating>
        <difficulty_rating>1.5</difficulty_rating>
        <distance>1km round trip</distance>
        <time_commitment>30-45 minutes</time_commitment>
        <GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
        <date_first_visited>2009-04-23</date_first_visited>
        <date_last_visited>2009-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
        <Post_Slug>asia-detian-waterfall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>1125</Entry_No>
    </Row>
    <Row>
        <Entry_No>1126</Entry_No>
    </Row>
    <Row>
        <Entry_No>1127</Entry_No>
    </Row>
</Root>

在这个论坛的一些人的指导下,我至少能够完成以下代码:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
import os

data_file = 'test_of_2016-09-19.xml'

tree = ET.ElementTree(file=data_file)
root = tree.getroot()

for element in root:
    if element.find('File_directory') is not None: 
        directory = element.find('File_directory').text
    if element.find('Introduction') is not None:
        introduction = element.find('Introduction').text
    if element.find('Directions') is not None:
        directions = element.find('Directions').text

    #The following code was suggested to me, but I'm having trouble getting them to work and understanding what each line is doing
    intro_tree = ET.ElementTree(directory+introduction) #throws NameError: name 'ET' is not defined
    intro_text = intro_tree.find('body').text #won't work since intro_tree not defined, but even then, I'm not sure what this line is trying to do
    intro = SubElement(element,'Introduction') #throws NameError: name 'SubElement' is not defined
    intro.text = intro_text #didn't get this far, but what is the intent of this line?
    # Do the same for Directions
    directions_tree = ET.ElementTree(directory+directions)
    directions_text = directions_tree.find('body').text
    directions = SubElement(element,'Direction')

# After the loop, write the file back with new elements added
tree.write('new_' + data_file)

因为我是Python的新手,我在尝试做这个看似简单的任务时遇到了很多困难,但是我觉得我对语法和正确的关键字和/或方法甚至库的使用一无所知。有没有更简单的方法?我是否正确地将Python与XML结合使用,并将ElementTree库用于此工作?还是lxml或minidom更好?我真的不知道,考虑到所有的选择和我在Python方面的缺乏背景,所有的文献都很混乱。你知道吗

任何帮助我度过这一僵局的人都将不胜感激。你知道吗

谢谢你, 约翰尼


Tags: orofnodatehtmltagslocalerow