如何从lxml text\u content（）中排除特定标记锚定的文本

normal = """ <a href='link1'> Forget me </a> I need this one <a href='link2'> Forget me too </a> Forget me not even when you go to sleep <a href='link3'> Forget me three </a> Foremost on your mind """

1条回答

网友

1楼 · 发布于 2024-06-09 01:46:03

我觉得你把事情弄得太复杂了。无需创建tree_struct对象并使用getpath()。这里有一个建议：

from lxml import html

normal = """
  <p>
    <b>
      <a href='link1'>        Forget me  </a>
    </b>     I need this one      <br>
    <b>
     <a href='link2'>  Forget me too  </a>
    </b> Forget me not <i>even when</i> you go to sleep <br>
    <b>  <a href='link3'>  Forget me three  </a>
    </b>  Foremost on your mind <br>
   </p>
"""

target = html.fromstring(normal)

for e in target.iter():
    if not e.tag == "a":
        # Print text content if not only whitespace 
        if e.text and e.text.strip():
            print(e.text.strip())
        # Print tail content if not only whitespace
        if e.tail and e.tail.strip():
            print(e.tail.strip())

输出：

I need this one
Forget me not
even when
you go to sleep
Foremost on your mind

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从lxml text\u content（）中排除特定标记锚定的文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >