如何使用Python删除部分XML数据并将其写入新文件

2024-05-23 17:09:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有如下数据结构。输入文件相当大,因此我试图找到一种有效的方法

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

给定一个包含多个文件的输入文件,例如

1
3

它将删除具有这些name的段。例如,给定了1和3,因此已删除名为1和3的段

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

到目前为止我掌握的代码

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()

with open('del_names.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    print(each_name)
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

代码的问题是,我现在运行代码两个小时,它甚至没有输出一行。我不确定我能用什么有效的方法来做这件事

我如何做到这一点?我也认为有2个可能是不必要的,对吗


Tags: 文件textnamenewsegmentsomecorpusxml
2条回答

如果您的代码花费的时间比预期的要长,您总是可以从一些print语句开始,以便更好地了解所花费的时间

对于您的任务,一个循环就足够了。迭代xml文件中的所有“段”元素。当段的名称包含在del_names.txt文件中时,将其删除

为了更快地查找名称,我将名称列表转换为set

from lxml import etree

with open("g.xml", "r") as xml_file:
    xml_data = xml_file.read()
print("read xml data")

with open('del_names.txt', 'r') as file:
    names_to_delete = set(file.read().split("\n"))
print("read text data")

new_xml = xml_data
tree = etree.XML(new_xml.encode())

for segment in tree.xpath("*//segment"):
    name = segment.attrib.get("name")
    if name in names_to_delete:
        print(f"will delete segment '{name}'")
        segment.getparent().remove(segment)

print(" result ".center(80, "="))

new_xml = str(etree.tostring(tree, encoding="unicode", pretty_print=True))
print(new_xml)

输出:

read xml data
read text data
will delete segment '1'
will delete segment '3'
==================================== result ====================================
<?xml version='1.0' encoding='ASCII'?>
<corpus name="corpus">
    <recording audio="audio.wav" name="first audio">
        <segment name="2" start="2" end="4">
            <orth>some text 2</orth>
        </segment>
    </recording>
</corpus>

您还可以使用BeautifulSoup

from bs4 import BeautifulSoup

my_string = """ <?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus> """

soup = BeautifulSoup(my_string, 'html.parser')
ids = [1,3] #IDs to delete

for id in ids:
    elements = soup.find_all("segment", attrs = {"name": str(id)})
    for element in elements:
        element.decompose()
    
print(soup.prettify())

相关问题 更多 >