XML文件处理子标签或使子标签的数量一致

2024-05-29 11:15:31 发布

您现在位置:Python中文网/ 问答频道 /正文

处理后的XML文件的内容如下:

<dblp>
<incollection> 
<author>Philippe Balbiani</author> 
<author>Valentin Goranko</author> 
<author>Ruaan Kellerman</author> 
<booktitle>Handbook of Spatial Logics</booktitle> 
</incollection>
<incollection> 
<author>Jochen Renz</author> 
<author>Bernhard Nebel</author> 
<booktitle>Handbook of AI</booktitle> 
</incollection>
...
</dblp>

格式内容如上所示,提取“author”标签内容和“booktitle”标签内容。它们都位于“incollection”标签中,遍历每个“incollection”标签,并具有多个“author”标签内容和一个“booktitle”标签内容,形成相应的元组。

我的代码:

soup = BeautifulSoup(str(getfile()), 'lxml')
res = soup.find_all('incollection')
author = []
booktitle =[]

for each in res:
    for child in each.children:
          if child.name == 'author':
                author.append(child.text)
          elif child.name == 'booktitle': 
                booktitle.append(child.text)
elem_dic = tuple(zip(author, booktitle))

我得出的结论是:

('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')

如何修改它以获得所需的结果?如下所示:

('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')
('Jochen Renz', 'Handbook of AI')
('Bernhard Nebel', 'Handbook of AI')

或者您可以将每个“incollection”标签中的“booktitle”标签添加到与“author”标签相同的编号中。


Tags: ofchild内容标签spatialauthorhandbookvalentin
1条回答
网友
1楼 · 发布于 2024-05-29 11:15:31

假设美联4.7+

这其实很容易做到。在本例中,我使用选择器(我知道选择器通常与HTML关联,但您可以在XML中使用它们来完成此类任务)。这里我们说我们想要所有的incollection标记,它们都有一个直接子(>)的标记,这些标记要么是author要么是booktitle:is(author, booktitle))。这只提供我们感兴趣的标签。然后我们简单地收集作者,直到我们看到书名,然后为该书创建条目。之后,我们重置并收集下一本书的信息:

from bs4 import BeautifulSoup

markup = """
<dblp>
<incollection> 
<author>Philippe Balbiani</author> 
<author>Valentin Goranko</author> 
<author>Ruaan Kellerman</author> 
<booktitle>Handbook of Spatial Logics</booktitle> 
</incollection>
<incollection> 
<author>Jochen Renz</author> 
<author>Bernhard Nebel</author> 
<booktitle>Handbook of AI</booktitle> 
</incollection>
</dblp>
"""

author = []
elem_dic = []

soup = BeautifulSoup(markup, 'xml')
for child in soup.select('incollection > :is(author,booktitle)'):
    if child.name == 'author':
        author.append(child.text)
    else:
        elem_dic.extend(zip(author, [child.text] * len(author)))
        author = []

print(tuple(elem_dic))

输出

(('Philippe Balbiani', 'Handbook of Spatial Logics'), ('Valentin Goranko', 'Handbook of Spatial Logics'), ('Ruaan Kellerman', 'Handbook of Spatial Logics'), ('Jochen Renz', 'Handbook of AI'), ('Bernhard Nebel', 'Handbook of AI'))

不过,您不必使用选择器:

from bs4 import BeautifulSoup, Tag

markup = """
<dblp>
<incollection> 
<author>Philippe Balbiani</author> 
<author>Valentin Goranko</author> 
<author>Ruaan Kellerman</author> 
<booktitle>Handbook of Spatial Logics</booktitle> 
</incollection>
<incollection> 
<author>Jochen Renz</author> 
<author>Bernhard Nebel</author> 
<booktitle>Handbook of AI</booktitle> 
</incollection>
</dblp>
"""

author = []
elem_dic = []

soup = BeautifulSoup(markup, 'xml')
res = soup.find_all('incollection')
for each in res:
    for child in each.children:
        if not isinstance(child, Tag):
            continue
        if child.name == 'author':
            author.append(child.text)
        else:
            elem_dic.extend(zip(author, [child.text] * len(author)))
            author = []

print(tuple(elem_dic))

相关问题 更多 >

    热门问题