从ElementTree查找返回的空列表

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.20wmf9</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace> </namespaces> </siteinfo> <page> <title>Aratrum</title> <ns>0</ns> <id>65741</id> <revision> <id>349931990</id> <parentid>225434394</parentid> <timestamp>2010-03-15T02:55:02Z</timestamp> <contributor> <ip>143.105.193.119</ip> </contributor> <comment>/* Sources */</comment> <sha1>2zkdnl9nsd1fbopv0fpwu2j5gdf0haw</sha1> <text xml:space="preserve" bytes="1436">'''Aratrum''' is the Latin word for [[plough]], and "arotron" (αροτρον) is the [[Greek language|Greek]] word. The [[Ancient Greece|Greeks]] appear to have had diverse kinds of plough from the earliest historical records. [[Hesiod]] advised the farmer to have always two ploughs, so that if one broke the other might be ready for use. These ploughs should be of two kinds, the one called "autoguos" (αυτογυος, "self-limbed"), in which the plough-tail was of the same piece of timber as the share-beam and the pole; and the other called "pekton" (πηκτον, "fixed"), because in it, three parts, which were of three kinds of timber, were adjusted to one another, and fastened together by nails. The ''autoguos'' plough was made from a [[sapling]] with two branches growing from its trunk in opposite directions. In ploughing, the trunk served as the pole, one of the two branches stood upwards and became the tail, and the other penetrated the ground and, sometimes shod with bronze or iron, acted as the [[ploughshare]]. ==Sources== Based on an article from ''A Dictionary of Greek and Roman Antiquities,'' John Murray, London, 1875. ἄρατρον ==External links== *[http://penelope.uchicago.edu/Thayer/E/Roman/Texts/secondary/SMIGRA*/Aratrum.html Smith's Dictionary article], with diagrams, further details, sources. [[Category:Agricultural machinery]] [[Category:Ancient Greece]] [[Category:Animal equipment]]</text> </revision> </page>

2条回答

网友

1楼 · 编辑于 2024-05-14 17:23:07

问题是您没有考虑XML名称空间。XML文档（以及其中的所有元素）位于http://www.mediawiki.org/xml/export-0.7/命名空间中。要想成功，你需要改变

titles = document.findall('.//title')

到

titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')

命名空间也可以通过namespaces参数提供：

NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)

这在Python2.7中有效，但在Python 2.7 documentation中没有解释（更好的是Python 3.3 documentation）。

另请参见http://effbot.org/zone/element-namespaces.htm和这个带答案的SO问题：Parsing XML with namespace in Python via 'ElementTree'。

^{}的问题是因为这个函数提供了(event, element)元组（而不仅仅是元素）。要获取标记名，请更改

for e in etree.iterparse(file_name):
    print e.tag

对此：

for e in etree.iterparse(file_name):
    print e[1].tag

网友

2楼 · 编辑于 2024-05-14 17:23:07

首先，需要找到父元素page。我不知道这个嵌套有多少层，但是一旦找到它，就可以立即获得title标记：

>>> page_tag = ET.fromstring(xdata)
>>> title_tag = page_tag.find('title')
>>> title_tag.text
'Aratrum'

随着更多信息的涌入，您可以执行以下操作：

def parser(file_name):
    document = etree.parse(file_name)
    titles = []
    for page_tag in document.findall('page'):
        titles.append(page_tag.find('title').text)
    return titles

希望这有帮助！

相关问题更多 >

编程相关推荐

热门问题

热门文章