使用python为非标准xm提供适当的xpath语法

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]> <us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23"> <applicants> <applicant sequence="001" app-type="applicant-inventor" designation="us-only"> <addressbook><last-name>Beyer</last-name> <first-name>Daniel Lee</first-name> <address><city>Franklin</city> <state>TN</state> <country>US</country></address></addressbook> <nationality><country>omitted</country></nationality> <residence><country>US</country></residence> </applicant> <applicant sequence="002" app-type="applicant-inventor" designation="us-only"> <addressbook><last-name>Friedland</last-name> <first-name>Jason Michael</first-name> <address><city>Franklin</city> <state>TN</state> <country>US</country></address></addressbook> <nationality><country>omitted</country></nationality> <residence><country>US</country></residence> </applicant> </applicants> </us-patent-grant> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>

import urllib2, os, zipfile from lxml import etree count = 0 for item in xmlSplitter(zf.open(xml_file)): count += 1 if count > 1: break doc = etree.XML(item) docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()')) title = first(doc.xpath('//invention-title/text()')) applicant = "-".join(doc.xpath('//applicants/applicant/*/text()')) print "DocID: {0}\nTitle: {1}\nApplicant: {2}\n".format(docID,title,applicant) outFile.write(str(docID) +"|"+ str(title) +"|"+ str(applicant) +"\n")

1条回答

网友

1楼 · 发布于 2024-05-23 22:26:50

这个问题与this other question of yours非常相似。在

这里有两个问题：

如何从“非标准XML”到“标准XML”？在
如何使用XPath获取子元素的文本值并将它们连接起来？在

在攻击2之前，你需要先解1。如果你需要帮助，可以另问一个问题。在

“非标准XML”与“根本不是XML”相同。不能将其解析为XML，也不能对其使用XPath。但你用了一种让人觉得你无论如何都在试图这么做。在

假设您的问题实际上是关于使用“标准XML”，那么使用my answer to your other question中相同的方法如何？在

相关问题更多 >

编程相关推荐

热门问题

热门文章