使用lxml在名称空间的xml元素中查找文本。

<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-06-01T19:20:30Z</responseDate> <request verb="ListRecords" from="1998-01-15" set="physics:hep" metadataPrefix="oai_rfc1807"> http://an.oa.org/OAI-script</request> <ListRecords> <record> <header> <identifier>oai:arXiv.org:hep-th/9901001</identifier> <datestamp>1999-12-25</datestamp> <setSpec>physics:hep</setSpec> <setSpec>math</setSpec> </header> <metadata> <rfc1807 xmlns= "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt http://www.openarchives.org/OAI/1.1/rfc1807.xsd"> <bib-version>v2</bib-version> <id>hep-th/9901001</id> <entry>January 1, 1999</entry> <title>Investigations of Radioactivity</title> <author>Ernest Rutherford</author> <date>March 30, 1999</date> </rfc1807> </metadata> <about> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:publisher>Los Alamos arXiv</dc:publisher> <dc:rights>Metadata may be used without restrictions as long as the oai identifier remains attached to it.</dc:rights> </oai_dc:dc> </about> </record> <record> <header status="deleted"> <identifier>oai:arXiv.org:hep-th/9901007</identifier> <datestamp>1999-12-21</datestamp> </header> </record> </ListRecords> </OAI-PMH>

3条回答

网友

1楼 · 编辑于 2024-05-23 18:42:32

免责声明：我正在使用标准库xml.etree.ElementTree模块，而不是lxml库（尽管据我所知，这是lxml的一个子集）。我确信有一个答案比我的答案简单得多，它使用lxml和XPATH，但我不知道。在

命名空间问题

你可能会说名称空间是对的。XML文件中没有record元素，但文件中有两个{http://www.openarchives.org/OAI/2.0/}record标记。如下所示：

>>> import xml.etree.ElementTree as etree

>>> xml_string = ...Your XML to parse...
>>> e = etree.fromstring(xml_string)

# Let's see what the root element is
>>> e
<Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80>

# Let's see what children there are of the root element
>>> for child in e:
...     print child
...
<Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8>
<Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050>
<Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>

# Finally, let's get the children of the `ListRecords` element
>>> for child in e[-1]:
...     print child
... 
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>
<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>

所以，举个例子

^{pr2}$
返回None，而
>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords' <Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098>
返回ListRecords元素。在
注意，我使用的是find方法，因为标准库ElementTree没有xpath方法。在
可能的解决方案
解决这个问题的一种方法是获取名称空间前缀，并将其添加到要查找的标记中。你可以用
>>>> e.tag[:e.tag.index('}')+1] '{http://www.openarchives.org/OAI/2.0/}'
在根元素e上查找名称空间，尽管我确信有更好的方法来完成此操作。在
现在，我们可以定义函数来提取我们想要的标记我们有一个可选的名称空间前缀：
def findallNS(element, tag, namespace=None): if namspace is not None: return element.findall(namepsace+tag) else: return element.findall(tag) def findNS(element, tag, namespace=None): if namspace is not None: return element.find(namepsace+tag) else: return element.find(tag)
所以现在我们可以写：
>>> list_records = findNS(e, 'ListRecords', namespace) >>> findallNS(list_records, 'record', namespace) [<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>, <Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>]
替代方案
另一种解决方案可能是编写一个函数来搜索以您感兴趣的标记结尾的所有标记，例如：
def find_child_tags(element, tag): return [child for child in element if child.tag.endswith(tag)]
这里根本不需要处理名称空间。在

网友
2楼 · 编辑于 2024-05-23 18:42:32

@Chris answer非常好，它也可以与lxml一起工作。下面是另一种使用lxml（与xpath而不是{}的方法相同）：
In [37]: xml.find('.//n:record', namespaces={'n': 'http://www.openarchives.org/OAI/2.0/'}) Out[37]: <Element {http://www.openarchives.org/OAI/2.0/}record at 0x2a451e0>

网友
3楼 · 编辑于 2024-05-23 18:42:32

正如Chris已经提到的，您还可以使用lxml和xpath。由于xpath不允许您像{http://www.openarchives.org/OAI/2.0/}record（所谓的“James Clark notation”*）那样完整地编写名称空间名称，因此您必须使用前缀，并为xpath引擎提供一个前缀到名称空间uri映射。在

以lxml为例（假设您已经拥有所需的tree对象）：

nsmap = {'oa':'http://www.openarchives.org/OAI/2.0/', 
         'dc':'http://purl.org/dc/elements/1.1/'}
tree.xpath('//oa:record[descendant::dc:publisher[contains(., "Alamos")]]',
            namespaces=nsmap)

这将选择具有包含单词“Alamos”的子元素{http://purl.org/dc/elements/1.1/}dc的所有{http://www.openarchives.org/OAI/2.0/}record元素。在

[*]这来自于一个article，其中James Clark解释了XML名称空间，不熟悉名称空间的每个人都应该阅读本文！（即使是很久以前写的）

命名空间问题

可能的解决方案

替代方案

相关问题更多 >

编程相关推荐

热门问题

热门文章