使用lxml按属性查找元素

66 投票

2 回答

76422 浏览

提问于 2025-04-16 12:23

我需要解析一个xml文件，以提取一些数据。
我只需要一些带有特定属性的元素，这里有一个文档的例子：

<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>

在这里，我只想获取类型为“news”的文章。
用lxml来做到这一点，最有效和优雅的方法是什么？

我试过用find方法，但效果不是很好：

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
    if "type" in article.keys():
        if article.attrib['type'] == 'news':
            content = article.find('content')
            content = content.text

数据提取 xml解析属性查找 lxml库查找元素文章类型

2 个回答

仅供参考，你可以使用 findall 来达到相同的效果：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

articles = root.find("articles")
article_list = articles.findall("article[@type='news']/content")
for a in article_list:
    print a.text

回答于 2025-04-16 由 Python大师

分享举报

你可以使用xpath，比如说 root.xpath("//article[@type='news']")

这个xpath表达式会返回所有带有"type"属性且值为"news"的<article/>元素的列表。你可以遍历这个列表，做你想做的事情，或者把它传递到其他地方。

如果你只想获取文本内容，可以这样扩展xpath：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

这样会输出 ['some text', 'some text']。或者如果你只想要内容元素，可以用 "//article[@type='news']/content" -- 依此类推。

回答于 2025-04-16 由 Python大师

分享举报

使用lxml按属性查找元素

2 个回答

撰写回答