lxml: 如何丢弃所有包含特定类链接的<li>元素？

1 投票

2 回答

516 浏览

提问于 2025-04-16 22:32

像往常一样，我在找 lxml 的文档时遇到了困难（给自己提个醒：应该写个好的 lxml 教程，吸引很多人来看看！）。

我想找到所有不包含特定类的 <a> 标签的 <li> 项目。

举个例子：

<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>

我只想获取那些不包含类为 new 的链接的 <li>，并且想要获取 <small> 标签里面的文本。换句话说，就是 'pudding'。

有人能帮忙吗？

谢谢！

lxml 网页抓取 html解析文本提取元素选择类过滤

2 个回答

我快速写了这段代码：

from lxml import etree
from lxml.cssselect import CSSSelector

str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""

html = etree.HTML(str)

bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')

bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])

for item in good:
  print(item.text)

这段代码首先创建了一个你不想要的项目列表，然后通过排除那些不好的项目来生成你想要的项目。

回答于 2025-04-16 由 Python大师

分享举报

import lxml.html as lh

content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''

tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
    print(elt)

# pudding

这个XPath的意思是：

//                        # from the root node, look at all descendants
li[                       # select nodes of type <li> who
    not(descendant::a[    # do not have a descendant of type <a>
        @class="new"])]   # with a class="new" attribute 
    /small                # select the node of type <small>
    /text()               # return the text of that node

回答于 2025-04-16 由 Python大师

分享举报

lxml: 如何丢弃所有包含特定类链接的<li>元素？

2 个回答

撰写回答