使用lxml在名称空间的xml元素中查找文本。问题的回答

使用lxml在名称空间的xml元素中查找文本。

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<p>我试着用lxml.etree解析XML文件并将文本查找到XML元素中。在</p> <p>XML文件可以是这样的：</p> <pre><code><?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-06-01T19:20:30Z</responseDate> <request verb="ListRecords" from="1998-01-15" set="physics:hep" metadataPrefix="oai_rfc1807"> http://an.oa.org/OAI-script</request> <ListRecords> <record> <header> <identifier>oai:arXiv.org:hep-th/9901001</identifier> <datestamp>1999-12-25</datestamp> <setSpec>physics:hep</setSpec> <setSpec>math</setSpec> </header> <metadata> <rfc1807 xmlns= "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1807.txt http://www.openarchives.org/OAI/1.1/rfc1807.xsd"> <bib-version>v2</bib-version> <id>hep-th/9901001</id> <entry>January 1, 1999</entry> <title>Investigations of Radioactivity</title> <author>Ernest Rutherford</author> <date>March 30, 1999</date> </rfc1807> </metadata> <about> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:publisher>Los Alamos arXiv</dc:publisher> <dc:rights>Metadata may be used without restrictions as long as the oai identifier remains attached to it.</dc:rights> </oai_dc:dc> </about> </record> <record> <header status="deleted"> <identifier>oai:arXiv.org:hep-th/9901007</identifier> <datestamp>1999-12-21</datestamp> </header> </record> </ListRecords> </OAI-PMH> </code></pre> <p>对于下面的部分，我们假设<code>doc = etree.parse("/tmp/test.xml")</code>其中文本.xml包含上面粘贴的xml。在</p> <p>首先，我尝试使用<code>doc.findall(".//record")</code>查找所有的<code><record></code>元素，但它返回一个空列表。在</p> <p>其次，对于一个给定的单词，我想检查它是否在<code><dc:publisher></code>中。为了实现这一点，我首先尝试和前面一样：<code>doc.findall(".//publisher")</code>但是我有同样的问题。。。我很确定所有这些都与名称空间相关，但我不知道如何处理它们。在</p> <p>我已经阅读了libxml<a href="http://lxml.de/tutorial.html" rel="nofollow">tutorial</a>，并在一个基本的xml文件（没有任何名称空间）上尝试了<code>findall</code>方法的示例，结果证明了这一点。在</p>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p><em>免责声明：我正在使用标准库xml.etree.ElementTree模块，而不是lxml库（尽管据我所知，这是lxml的一个子集）。我确信有一个答案比我的答案简单得多，它使用lxml和XPATH，但我不知道。在</p> <h2>命名空间问题</h2> <p>你可能会说名称空间是对的。XML文件中没有<code>record</code>元素，但文件中有两个<code>{http://www.openarchives.org/OAI/2.0/}record</code>标记。如下所示：</p> <pre><code>>>> import xml.etree.ElementTree as etree >>> xml_string = ...Your XML to parse... >>> e = etree.fromstring(xml_string) # Let's see what the root element is >>> e <Element {http://www.openarchives.org/OAI/2.0/}OAI-PMH at 7f39ebf54f80> # Let's see what children there are of the root element >>> for child in e: ... print child ... <Element {http://www.openarchives.org/OAI/2.0/}responseDate at 7f39ebf54fc8> <Element {http://www.openarchives.org/OAI/2.0/}request at 7f39ebf58050> <Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098> # Finally, let's get the children of the `ListRecords` element >>> for child in e[-1]: ... print child ... <Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0> <Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908> </code></pre> <p>所以，举个例子</p> ^{pr2}$ <p>返回<code>None</code>，而</p> <pre><code>>>> e.find('{http://www.openarchives.org/OAI/2.0/}ListRecords' <Element {http://www.openarchives.org/OAI/2.0/}ListRecords at 7f39ebf58098> </code></pre> <p>返回<code>ListRecords</code>元素。在</p> <p>注意，我使用的是<code>find</code>方法，因为标准库ElementTree没有<code>xpath</code>方法。在</p> <h2>可能的解决方案</h2> <p>解决这个问题的一种方法是获取名称空间前缀，并将其添加到要查找的标记中。你可以用</p> <pre><code>>>>> e.tag[:e.tag.index('}')+1] '{http://www.openarchives.org/OAI/2.0/}' </code></pre> <p>在根元素<code>e</code>上查找名称空间，尽管我确信有更好的方法来完成此操作。在</p> <p>现在，我们可以定义函数来提取我们想要的标记我们有一个可选的名称空间前缀：</p> <pre><code>def findallNS(element, tag, namespace=None): if namspace is not None: return element.findall(namepsace+tag) else: return element.findall(tag) def findNS(element, tag, namespace=None): if namspace is not None: return element.find(namepsace+tag) else: return element.find(tag) </code></pre> <p>所以现在我们可以写：</p> <pre><code>>>> list_records = findNS(e, 'ListRecords', namespace) >>> findallNS(list_records, 'record', namespace) [<Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf580e0>, <Element {http://www.openarchives.org/OAI/2.0/}record at 7f39ebf58908>] </code></pre> <h2>替代方案</h2> <p>另一种解决方案可能是编写一个函数来搜索以您感兴趣的标记结尾的所有标记，例如：</p> <pre><code>def find_child_tags(element, tag): return [child for child in element if child.tag.endswith(tag)] </code></pre> <p>这里根本不需要处理名称空间。在</p>

使用lxml在名称空间的xml元素中查找文本。

1 个回答

相关Python问题