查找具有特定属性值的所有标签

2 投票
1 回答
1346 浏览
提问于 2025-04-16 04:30

我想知道怎么遍历所有带有特定属性和特定值的标签。比如说,我们只需要data1、data2这些。

<html>
    <body>
        <invalid html here/>
        <dont care> ... </dont care>
        <invalid html here too/>
        <interesting attrib1="naah, it is not this"> ... </interesting tag>
        <interesting attrib1="yes, this is what we want">
            <group>
                <line>
                    data
                </line>
            </group>
            <group>
                <line>
                    data1
                <line>
            </group>
            <group>
                <line>
                    data2
                <line>
            </group>
        </interesting>
    </body>
</html>

我试过用BeautifulSoup,但它无法解析这个文件。不过,lxml的解析器似乎可以用:

broken_html = get_sanitized_data(SITE)

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

result = etree.tostring(tree.getroot(), pretty_print=True, method="html")

print(result)

我对它的API不太熟悉,也不知道怎么使用getiterator或者xpath。

1 个回答

3

这里有一种方法,使用 lxml 和 XPath。这个 XPath 表达式是 'descendant::*[@attrib1="yes, this is what we want"]'。它的意思是告诉 lxml 去查看当前节点的所有子孙节点,并返回那些 attrib1 属性等于 "yes, this is what we want" 的节点。

import lxml.html as lh 
import cStringIO

content='''
<html>
    <body>
        <invalid html here/>
        <dont care> ... </dont care>
        <invalid html here too/>
        <interesting attrib1="naah, it is not this"> ... </interesting tag>
        <interesting attrib1="yes, this is what we want">
            <group>
                <line>
                    data
                </line>
            </group>
            <group>
                <line>
                    data1
                <line>
            </group>
            <group>
                <line>
                    data2
                <line>
            </group>
        </interesting>
    </body>
</html>
'''
doc=lh.parse(cStringIO.StringIO(content))
tags=doc.xpath('descendant::*[@attrib1="yes, this is what we want"]')
print(tags)
# [<Element interesting at b767e14c>]
for tag in tags:
    print(lh.tostring(tag))
# <interesting attrib1="yes, this is what we want"><group><line>
#                     data
#                 </line></group><group><line>
#                     data1
#                 <line></line></line></group><group><line>
#                     data2
#                 <line></line></line></group></interesting>

撰写回答