如何用靓汤从ul中正确提取锂元素？

<div class="content-container"> <h2>Description</h2> <pre>Manage the wine production and review the production pipeline and volumes.</pre> <h2>Alternative label</h2> <ul> <li><p>managing production of wine</p></li> <li><p>supervising wine production</p></li> <li><p>wine production managing</p></li> <li><p>supervising production of wine</p></li> <li><p>supervise wine production</p></li> <li><p>wine production supervising</p></li> <li><p>managing wine production</p></li> </ul> <h2>Skill type</h2> <ul>

somehtmlContent =BeautifulSoup(somehtml.content,"lxml") for item in somehtmlContent.find_all("div", {"class": "content-container"}): try: altlabel = item.find(text="Alternative label") h2tag = altlabel.parent ultag = h2tag.findNext('ul') litags = [] for litag in ultag: litags.append(litag.findNext('p').text) for tag in litags: print(tag) except: pass

managing production of wine managing production of wine supervising wine production supervising wine production wine production managing wine production managing supervising production of wine supervising production of wine supervise wine production supervise wine production wine production supervising wine production supervising managing wine production managing wine production

2条回答

网友

1楼 · 编辑于 2024-06-06 08:51:07

这行有个小错误for litag in ultag。而是使用for litag in ultag.find_all('li')。你知道吗
您的代码在使用for litag in ultag时给出了一些空行。对于每个空行，下一个<p>标记被追加。这就是代码中存在重复项的原因。你知道吗
以下代码将按预期工作

somehtmlContent =bsp(s,"html")
litags = []
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
        print('-'*100) 
        try: 
            altlabel =  item.find(text="Alternative label")
            h2tag = altlabel.parent
            ultag = h2tag.findNext('ul')
            for litag in ultag.find_all('li'):
                litags.append(litag.findNext('p').text)
            for tag in litags:
                print(tag)
        except:
            pass

网友

2楼 · 编辑于 2024-06-06 08:51:07

问题中包含HTML片段的当前代码不会打印任何内容-以异常处理程序结束。问题在于：

        for litag in ultag:
            litags.append(litag.findNext('p').text)

您在ultag.contents上有效地迭代，它包含所有标记和NavigableStrings。要解决此问题，请仅迭代<p>标记：

data = '''<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>'''

from bs4 import BeautifulSoup

somehtmlContent =BeautifulSoup(data,"lxml")

for item in somehtmlContent.find_all("div", {"class": "content-container"}):
    try:
        altlabel =  item.find(text="Alternative label")
        h2tag = altlabel.parent
        ultag = h2tag.findNext('ul')
        litags = []
        for p in ultag.find_all('p'):
            litags.append(p.text)
        for tag in litags:
            print(tag)
    except:
        pass

印刷品：

managing production of wine
supervising wine production
wine production managing
supervising production of wine
supervise wine production
wine production supervising
managing wine production

编辑：获取内容的较短方法如下soup.select('h2:contains("Alternative label") + ul p')-选择包含“可选标签”的<h2>，第一个同级是<ul>以及其中的所有<p>：

for p in soup.select('h2:contains("Alternative label") + ul p'):
    print(p.text)

相关问题更多 >

编程相关推荐

热门问题

热门文章