网页抓取新文章

<<div class="w980 wbnav clear"><a href="http://english.peopledaily.com.cn/" target="_blank">English</a>>></div> <div class="w980 wb_10 clear"> <h1>DPRK launches ballistic missile 'capable of hitting US mainland'</h1> <div> (<a </div> <div class="wb_12 clear"> <p style="text-align: center;"> <img alt="" src="/NMediaFile/2017/1129/FOREIGN201711291331000220555852915.jpg" style="width: 900px; height: 783px;" /></p> <p> The Democratic Peopleâs Republic of Korea (DPRK) has confirmed that it successfully tested a âHwasong 15â intercontinental ballistic missile (ICBM) on Wednesday.</p> <p> A Korean Central News Agency (KCNA) statement, which confirms earlier assessments from the United States and the Republic of Korea (ROK), claims the new type of ICBM "is capable of striking the whole mainland of the US." </p> <p> It was Pyongyang's first test launch since a missile was fired in mid-September, days after its sixth-nuclear test.</p> <p> The ICBM was launched at 02:48 local time on Wednesday, according to the KCNA statement, and flew to an altitude of 4,475 km and then a distance of 950 km.</p> <p> It was launched from Sain Ni in the DPRK and flew for 53 minutes before splashing down into the Sea of Japan, said Pentagon spokesman Robert Manning.</p>

1条回答

网友

1楼 · 发布于 2024-04-26 05:32:37

我打开网站链接（http://en.people.cn/index.html），看了看文章。你知道吗

如果您只想从一篇特定的文章中获取数据，比如http://en.people.cn/n3/2017/1220/c90000-9306707.html

然后你可以使用以下代码-

import requests
from bs4 import BeautifulSoup
import sys

r=requests.get('http://en.people.cn/n3/2017/1220/c90000-9306707.html')

c=r.content
soup=BeautifulSoup(c,'html.parser')

all=soup.find("div",{"class":"d2p3_left wb_left fl"})

d={}
d["heading"]=all.find("h2").text




d["content"]=all.find_all("p")

p=''
for item in d["content"]:
    p=p+item.text


p.replace("\t","")
d["content"]=p
f=open('article1.txt','w')

for item in d.values():
    f.write(item)

f.close()

现在我也检查了其他文章，它们似乎都在使用d2p3_left wb_left fl类对包含实际文章内容的htmldiv标记进行分类。你知道吗

所以我从这个特殊的标签中提取了内容，并将它们放在一个字典中，其中有“heading”和“content”键，这样如果需要的话，它们就可以被格式化。你知道吗

然后我将dictionary的所有值导出到一个文本文件中。你知道吗

如果你想从主页上抓取所有的文章，那么你只需要获取一个列表中的链接，然后作为requests.get()方法的参数遍历列表项。你知道吗

希望这有帮助。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章