lxml 过滤没有子标签文本的 HTML 标签

1 投票

2 回答

514 浏览

提问于 2025-04-17 21:12

我有一些这样的文档

....
  <tag1>
     <tag2>Foo</tag2>
     <tag3>Bar</tag3>
  </tag1>

  <tag1>
     <tag2>Foo</tag2>
     <tag3>Bar</tag3>
     Foo
  </tag1>

  <tag1>
     <tag2>Foo</tag2>     
     Foo
     <tag3>Bar</tag3>
  </tag1>

  <tag1>
     Foo
  </tag1>
 ....

我想筛选出那些只有子标签的标签，也就是说，在子标签之间没有任何文本的标签。在上面的例子中，它应该返回第一个 <tag1>。

我最开始的代码是

from lxml import html

html_content = html.fromstring(content)
tag1 = html_content.xpath('//tag1')
tags = []
for tag in tag1:
   exists = False
   for child in tag.getchildren():
      exists = exists or (len(child.tag) == 0)
   if (not exists):
      tags.append(tag)

但结果发现 getchildren() 并不能返回那些不在任何标签之间的文本。我该怎么做呢？

2 个回答

getchildren() 方法的作用

这个方法会返回所有直接的子元素，返回的顺序和文档中的顺序是一样的。

所以，getchildren() 返回的是节点。每个节点都有一些属性：

标签，
尾部，
文本，还有
其他的属性，可以查看文档。

针对你问的问题，答案是尾部，它会给你提供

在这个元素的结束标签后面，但在下一个兄弟元素的开始标签之前的文本。这可以是一个字符串，也可以是 None，表示没有文本。

回答于 2025-04-17 由 Python大师

分享举报

使用标签的 .tail 属性：

for tag in tag1:
    exists = False
    for child in tag.getchildren():
        exists = exists or not child.tail.strip()
    if not exists:
        tags.append(tag)

根据你对“只有子标签”的理解，这个可以等同于：

for tag in tag1:
  children = tag.getchildren()
  no_extra_text = not any(child.tail.strip() for child in children)
  if children and no_extra_text:
    tags.append(tag)

这里有个更新，加入了检查前导文本的功能，并在文本为 None 时去掉错误（我觉得它应该总是一个字符串）：

for tag in tag1:
  children = tag.getchildren()
  no_extra_text = not any(child.tail and child.tail.strip() for child in children)
  no_text = tag.text and not tag.text.strip()
  if children and no_extra_text and no_text:
    tags.append(tag)

回答于 2025-04-17 由 Python大师

分享举报

lxml 过滤没有子标签文本的 HTML 标签

2 个回答

撰写回答