如何用靓汤从ul中正确提取锂元素?

2024-06-06 08:51:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个html,看起来像这样:

<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>

我要做的是收集<h2>Alternative label</h2>中存在的所有li元素。这是我到目前为止的代码片段:

somehtmlContent =BeautifulSoup(somehtml.content,"lxml")
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
         try: 
            altlabel =  item.find(text="Alternative label")
            h2tag = altlabel.parent
            ultag = h2tag.findNext('ul')
            litags = []
            for litag in ultag:
                litags.append(litag.findNext('p').text)
            for tag in litags:
                print(tag)
         except:
            pass

尽管如此,当我打印litags列表的内容时,我看到所有实体都打印了两次,如下所示:

managing production of wine
managing production of wine
supervising wine production
supervising wine production
wine production managing
wine production managing
supervising production of wine
supervising production of wine
supervise wine production
supervise wine production
wine production supervising
wine production supervising
managing wine production
managing wine production

有人能帮我理解为什么会这样吗?我很感激你能提供的任何帮助


Tags: ofinforlih2contentullabel
2条回答
  • 这行有个小错误for litag in ultag。而是使用for litag in ultag.find_all('li')。你知道吗
  • 您的代码在使用for litag in ultag时给出了一些空行。对于每个空行,下一个<p>标记被追加。这就是代码中存在重复项的原因。你知道吗
  • 以下代码将按预期工作
somehtmlContent =bsp(s,"html")
litags = []
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
        print('-'*100) 
        try: 
            altlabel =  item.find(text="Alternative label")
            h2tag = altlabel.parent
            ultag = h2tag.findNext('ul')
            for litag in ultag.find_all('li'):
                litags.append(litag.findNext('p').text)
            for tag in litags:
                print(tag)
        except:
            pass

问题中包含HTML片段的当前代码不会打印任何内容-以异常处理程序结束。问题在于:

        for litag in ultag:
            litags.append(litag.findNext('p').text)

您在ultag.contents上有效地迭代,它包含所有标记和NavigableStrings。要解决此问题,请仅迭代<p>标记:

data = '''<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>'''

from bs4 import BeautifulSoup

somehtmlContent =BeautifulSoup(data,"lxml")

for item in somehtmlContent.find_all("div", {"class": "content-container"}):
    try:
        altlabel =  item.find(text="Alternative label")
        h2tag = altlabel.parent
        ultag = h2tag.findNext('ul')
        litags = []
        for p in ultag.find_all('p'):
            litags.append(p.text)
        for tag in litags:
            print(tag)
    except:
        pass

印刷品:

managing production of wine
supervising wine production
wine production managing
supervising production of wine
supervise wine production
wine production supervising
managing wine production

编辑:获取内容的较短方法如下soup.select('h2:contains("Alternative label") + ul p')-选择包含“可选标签”的<h2>,第一个同级是<ul>以及其中的所有<p>

for p in soup.select('h2:contains("Alternative label") + ul p'):
    print(p.text)

相关问题 更多 >