我试图从这里的页面中提取一个项目符号列表:http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/
具体来说,下面的截图中用黄色突出显示的子弹。你知道吗
首先,我使用beautiful soup过滤掉所有没有属性的<ul>
标记:
text = BeautifulSoup(requests.get('http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/', timeout=7.00).text)
bullets = text.find_all(lambda tag: tag.name == 'ul' and not tag.attrs)
下面是返回的两个<ul>
标记:
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>
<ul><li class="share-item share-fb" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="facebook" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Facebook"></li><li class="share-item share-tw" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="twitter" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Twitter"></li><li class="share-item share-gp" data-lang="en-US" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="googlePlus" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Google+"></li><li class="share-item share-pn" data-media="http://bodetree.com/wp-content/uploads/2015/04/pain-points.png" data-title="What is Causing your Headaches?- Startup Pain Points" data-type="pinterest" data-url="http://bodetree.com/what-is-causing-your-headaches-startup-pain-points/" title="Pinterest"></li></ul>
我只想提取出现在页面主体中的<ul>
标记,所以我想过滤掉第二个<ul>
标记以及其中的垃圾。似乎页面主体中没有出现的<ul>
标记具有<li>
标记,这些标记中包含属性,因此我们可以根据这些属性进行过滤。基本上我只需要一个<ul><li>string</li></ul>
形式的标记结构。所以在本例中,我只想返回<ul>
:
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li>19 Questions to Ask Yourself Before You Start Rebranding</li>
</ul>
有没有办法用find_all()实现这一点?你知道吗
在文章中搜索
ul
,它是一个div
,带有class="entry-content"
:印刷品:
相关问题 更多 >
编程相关推荐