XML文件处理子标签或使子标签的数量一致

<dblp> <incollection> <author>Philippe Balbiani</author> <author>Valentin Goranko</author> <author>Ruaan Kellerman</author> <booktitle>Handbook of Spatial Logics</booktitle> </incollection> <incollection> <author>Jochen Renz</author> <author>Bernhard Nebel</author> <booktitle>Handbook of AI</booktitle> </incollection> ... </dblp>

soup = BeautifulSoup(str(getfile()), 'lxml') res = soup.find_all('incollection') author = [] booktitle =[] for each in res: for child in each.children: if child.name == 'author': author.append(child.text) elif child.name == 'booktitle': booktitle.append(child.text) elem_dic = tuple(zip(author, booktitle))

('Philippe Balbiani', 'Handbook of Spatial Logics') ('Valentin Goranko', 'Handbook of Spatial Logics') ('Ruaan Kellerman', 'Handbook of Spatial Logics') ('Jochen Renz', 'Handbook of AI') ('Bernhard Nebel', 'Handbook of AI')

1条回答

网友

1楼 · 发布于 2024-05-29 11:15:31

假设美联4.7+

这其实很容易做到。在本例中，我使用选择器（我知道选择器通常与HTML关联，但您可以在XML中使用它们来完成此类任务）。这里我们说我们想要所有的incollection标记，它们都有一个直接子（>）的标记，这些标记要么是author要么是booktitle（:is(author, booktitle)）。这只提供我们感兴趣的标签。然后我们简单地收集作者，直到我们看到书名，然后为该书创建条目。之后，我们重置并收集下一本书的信息：

from bs4 import BeautifulSoup

markup = """
<dblp>
<incollection> 
<author>Philippe Balbiani</author> 
<author>Valentin Goranko</author> 
<author>Ruaan Kellerman</author> 
<booktitle>Handbook of Spatial Logics</booktitle> 
</incollection>
<incollection> 
<author>Jochen Renz</author> 
<author>Bernhard Nebel</author> 
<booktitle>Handbook of AI</booktitle> 
</incollection>
</dblp>
"""

author = []
elem_dic = []

soup = BeautifulSoup(markup, 'xml')
for child in soup.select('incollection > :is(author,booktitle)'):
    if child.name == 'author':
        author.append(child.text)
    else:
        elem_dic.extend(zip(author, [child.text] * len(author)))
        author = []

print(tuple(elem_dic))

输出

(('Philippe Balbiani', 'Handbook of Spatial Logics'), ('Valentin Goranko', 'Handbook of Spatial Logics'), ('Ruaan Kellerman', 'Handbook of Spatial Logics'), ('Jochen Renz', 'Handbook of AI'), ('Bernhard Nebel', 'Handbook of AI'))

不过，您不必使用选择器：

from bs4 import BeautifulSoup, Tag

markup = """
<dblp>
<incollection> 
<author>Philippe Balbiani</author> 
<author>Valentin Goranko</author> 
<author>Ruaan Kellerman</author> 
<booktitle>Handbook of Spatial Logics</booktitle> 
</incollection>
<incollection> 
<author>Jochen Renz</author> 
<author>Bernhard Nebel</author> 
<booktitle>Handbook of AI</booktitle> 
</incollection>
</dblp>
"""

author = []
elem_dic = []

soup = BeautifulSoup(markup, 'xml')
res = soup.find_all('incollection')
for each in res:
    for child in each.children:
        if not isinstance(child, Tag):
            continue
        if child.name == 'author':
            author.append(child.text)
        else:
            elem_dic.extend(zip(author, [child.text] * len(author)))
            author = []

print(tuple(elem_dic))

相关问题更多 >

编程相关推荐

热门问题

热门文章