我有一个html,看起来像这样:
<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>
我要做的是收集<h2>Alternative label</h2>
中存在的所有li元素。这是我到目前为止的代码片段:
somehtmlContent =BeautifulSoup(somehtml.content,"lxml")
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
try:
altlabel = item.find(text="Alternative label")
h2tag = altlabel.parent
ultag = h2tag.findNext('ul')
litags = []
for litag in ultag:
litags.append(litag.findNext('p').text)
for tag in litags:
print(tag)
except:
pass
尽管如此,当我打印litags
列表的内容时,我看到所有实体都打印了两次,如下所示:
managing production of wine
managing production of wine
supervising wine production
supervising wine production
wine production managing
wine production managing
supervising production of wine
supervising production of wine
supervise wine production
supervise wine production
wine production supervising
wine production supervising
managing wine production
managing wine production
有人能帮我理解为什么会这样吗?我很感激你能提供的任何帮助
for litag in ultag
。而是使用for litag in ultag.find_all('li')
。你知道吗for litag in ultag
时给出了一些空行。对于每个空行,下一个<p>
标记被追加。这就是代码中存在重复项的原因。你知道吗问题中包含HTML片段的当前代码不会打印任何内容-以异常处理程序结束。问题在于:
您在
ultag.contents
上有效地迭代,它包含所有标记和NavigableStrings
。要解决此问题,请仅迭代<p>
标记:印刷品:
编辑:获取内容的较短方法如下
soup.select('h2:contains("Alternative label") + ul p')
-选择包含“可选标签”的<h2>
,第一个同级是<ul>
以及其中的所有<p>
:相关问题 更多 >
编程相关推荐