在过去的几个月里,我一直在学习python和BeautifulSoup函数,试图将if主要用于我自己的研究目的的网络新闻文章。你知道吗
然而,我一直很难从中文网站上把内容很好地打印出来。你知道吗
我应该使用哪个标签来获取文章的内容?你知道吗
<<div class="w980 wbnav clear"><a
href="http://english.peopledaily.com.cn/"
target="_blank">English</a>>></div>
<div class="w980 wb_10 clear">
<h1>DPRK launches ballistic missile 'capable of hitting US
mainland'</h1>
<div> (<a
</div>
<div class="wb_12 clear">
<p style="text-align: center;">
<img alt="" src="/NMediaFile/2017/1129/FOREIGN201711291331000220555852915.jpg" style="width: 900px; height: 783px;" /></p>
<p>
The Democratic Peopleâs Republic of Korea (DPRK) has confirmed that it successfully tested a âHwasong 15â intercontinental ballistic missile (ICBM) on Wednesday.</p>
<p>
A Korean Central News Agency (KCNA) statement, which confirms earlier assessments from the United States and the Republic of Korea (ROK), claims the new type of ICBM "is capable of striking the whole mainland of the US."
</p>
<p>
It was Pyongyang's first test launch since a missile was fired in mid-September, days after its sixth-nuclear test.</p>
<p>
The ICBM was launched at 02:48 local time on Wednesday, according to the KCNA statement, and flew to an altitude of 4,475 km and then a distance of 950 km.</p>
<p>
It was launched from Sain Ni in the DPRK and flew for 53 minutes before splashing down into the Sea of Japan, said Pentagon spokesman Robert Manning.</p>
我打开网站链接(http://en.people.cn/index.html),看了看文章。你知道吗
如果您只想从一篇特定的文章中获取数据,比如http://en.people.cn/n3/2017/1220/c90000-9306707.html
然后你可以使用以下代码-
现在我也检查了其他文章,它们似乎都在使用
d2p3_left wb_left fl
类对包含实际文章内容的htmldiv标记进行分类。你知道吗所以我从这个特殊的标签中提取了内容,并将它们放在一个字典中,其中有“heading”和“content”键,这样如果需要的话,它们就可以被格式化。你知道吗
然后我将dictionary的所有值导出到一个文本文件中。你知道吗
如果你想从主页上抓取所有的文章,那么你只需要获取一个列表中的链接,然后作为
requests.get()
方法的参数遍历列表项。你知道吗希望这有帮助。你知道吗
相关问题 更多 >
编程相关推荐