网页抓取新文章

2024-04-26 05:32:37 发布

您现在位置:Python中文网/ 问答频道 /正文

在过去的几个月里,我一直在学习python和BeautifulSoup函数,试图将if主要用于我自己的研究目的的网络新闻文章。你知道吗

然而,我一直很难从中文网站上把内容很好地打印出来。你知道吗

我应该使用哪个标签来获取文章的内容?你知道吗

<<div class="w980 wbnav clear"><a 
href="http://english.peopledaily.com.cn/" 
target="_blank">English</a>&gt;&gt;</div>
<div class="w980 wb_10 clear">
<h1>DPRK launches ballistic missile 'capable of hitting US 
mainland'</h1>
<div> (<a 



</div>
<div class="wb_12 clear">
<p style="text-align: center;">
<img alt="" src="/NMediaFile/2017/1129/FOREIGN201711291331000220555852915.jpg" style="width: 900px; height: 783px;" /></p>
<p>
The Democratic Peopleâs Republic of Korea (DPRK) has confirmed that it successfully tested a âHwasong 15â intercontinental ballistic missile (ICBM) on Wednesday.</p>
<p>
A Korean Central News Agency (KCNA) statement, which confirms earlier assessments from the United States and the Republic of Korea (ROK), claims the new type of ICBM "is capable of striking the whole mainland of the US."
</p>
<p>
It was Pyongyang's first test launch since a missile was fired in mid-September, days after its sixth-nuclear test.</p>
<p>
The ICBM was launched at 02:48 local time on Wednesday, according to the KCNA statement, and flew to an altitude of 4,475 km and then a distance of 950 km.</p>
<p>
It was launched from Sain Ni in the DPRK and flew for 53 minutes before splashing down into the Sea of Japan, said Pentagon spokesman Robert Manning.</p>

Tags: andofthegtdiv内容文章class
1条回答
网友
1楼 · 发布于 2024-04-26 05:32:37

我打开网站链接(http://en.people.cn/index.html),看了看文章。你知道吗

如果您只想从一篇特定的文章中获取数据,比如http://en.people.cn/n3/2017/1220/c90000-9306707.html

然后你可以使用以下代码-

import requests
from bs4 import BeautifulSoup
import sys

r=requests.get('http://en.people.cn/n3/2017/1220/c90000-9306707.html')

c=r.content
soup=BeautifulSoup(c,'html.parser')

all=soup.find("div",{"class":"d2p3_left wb_left fl"})

d={}
d["heading"]=all.find("h2").text




d["content"]=all.find_all("p")

p=''
for item in d["content"]:
    p=p+item.text


p.replace("\t","")
d["content"]=p
f=open('article1.txt','w')

for item in d.values():
    f.write(item)

f.close()

现在我也检查了其他文章,它们似乎都在使用d2p3_left wb_left fl类对包含实际文章内容的htmldiv标记进行分类。你知道吗

所以我从这个特殊的标签中提取了内容,并将它们放在一个字典中,其中有“heading”和“content”键,这样如果需要的话,它们就可以被格式化。你知道吗

然后我将dictionary的所有值导出到一个文本文件中。你知道吗

如果你想从主页上抓取所有的文章,那么你只需要获取一个列表中的链接,然后作为requests.get()方法的参数遍历列表项。你知道吗

希望这有帮助。你知道吗

相关问题 更多 >