如何使用lxml最有效地展平DOM？

上下文

下面的python2.7函数使用etree和xpath遍历DOM，并构建DOM的扁平列表表示。在每个节点上，它检查当前元素是否有一个应该忽略的类—如果是，它将跳过元素及其子元素。你知道吗

import re from lxml import etree ignore_classes = ['ignore'] def flatten_tree(element): children = element.findall('*') elements = [] if len(children) > 0: for child in children: if child.attrib.get('class') in ignore_classes: continue else: for el in get_children(child): elements.append(el) elements.insert(0, element) return elements

示例

本文件：

<html> <body> <header class="ignore"> <h1>Gerbils</h1> </header> <main> <p>They like almonds. That's pretty much all I know.</p> </main> </body> </html>

会变成这样：

[ <html>, <body>, <main>, <p> ]

提前谢谢！你知道吗

2条回答

网友

1楼 · 编辑于 2024-06-02 07:26:31

你可以用doImplementation.createDocument文件带参数。你知道吗

网友

2楼 · 编辑于 2024-06-02 07:26:31

可以使用XPath，例如

In [24]: root.xpath('descendant-or-self::*[not(ancestor-or-self::*[@class="ignore"])]')
Out[24]: 
[<Element html at 0x7f4d5e1c1548>,
 <Element body at 0x7f4d5e1dba48>,
 <Element main at 0x7f4d5024e6d8>,
 <Element p at 0x7f4d5024e728>]

XPathdescendant-or-self::*[not(ancestor-or-self::*[@class="ignore"])]表示

descendant-or-self::*          select the current node and all its descendants
  [                            such that
   not(                        it is not true that
     ancestor-or-self::*       it itself or an ancestor
       [@class="ignore"]       has an attribute, class, equal to "ignore"
   )]

要处理要忽略的类名列表，可以使用一些代码构建XPath。例如，如果ignore_classes = ['A', 'B']，那么您可以定义

conditions = ' or '.join([
    'ancestor-or-self::*[@class="{}"]'.format(cls) for cls in ignore_classes])
xpath = 'descendant-or-self::*[not({})]'.format(conditions)

所以xpath等于

'descendant-or-self::*[not(ancestor-or-self::*[@class="A"] or ancestor-or-self::*[@class="B"])]'

尽管这看起来很冗长，但使用lxml的XPath引擎应该非常重要比用Python遍历树更快。你知道吗

import lxml.html as LH

html = """
<html>
    <body>
        <header class="ignore">
            <h1>Gerbils</h1>
        </header>
        <main class="ignore2">
            <p>They like almonds. That's pretty much all I know.</p>
        </main>
    </body>
</html>"""

def flatten_element(element, ignore_classes):
    conditions = ' or '.join([
        'ancestor-or-self::*[@class="{}"]'.format(cls) for cls in ignore_classes])
    xpath = 'descendant-or-self::*[not({})]'.format(conditions)
    return element.xpath(xpath)

root = LH.fromstring(html)
ignore_classes = ['ignore']
flattened = flatten_element(root, ignore_classes)
print(flattened)

收益率

[<Element html at 0x7f30af3459a8>, <Element body at 0x7f30af367ea8>, <Element main at 0x7f30af2fbdb8>, <Element p at 0x7f30af2fbae8>]

上下文

问题

示例

相关问题更多 >

编程相关推荐

热门问题

热门文章