从新闻网站beautifulsoup python的已删除链接中删除新闻文章

2024-04-25 17:55:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我想搜刮一些印尼新闻网站。我搜集的内容是网站上最近的热门新闻。输出如下:enter image description here

这是我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

kompas = requests.get('https://www.kompas.com/')
beautify = BeautifulSoup(kompas.content)

news = beautify.find_all('div', {'class','most__list clearfix'})
arti = []
for each in news:
  nu = each.find('div', {'class','most__count'}).text
  title = each.find('h4', {'class','most__title'}).text
  lnk = each.a.get('href')
  rcount = each.find('div', {'class','most__read'}).text
  print(nu)
  print(title)
  print(lnk)
  print(rcount)

  arti.append({
    'Top Number': nu,
    'Headline': title,
    'Link': lnk,
    'Most Read': rcount
  })

df = pd.DataFrame(arti)
df.to_csv('kompas.csv', index=False)

实际上,我想要的不仅仅是标题、链接和大多数作为输出阅读的内容,我也想要这篇文章。但文章内容不在页面(主页面)中。所以我必须点击链接才能看到这篇文章。 任何帮助都将不胜感激


Tags: csvtextimportdivmosttitlefindnu
1条回答
网友
1楼 · 发布于 2024-04-25 17:55:00

这将有助于您:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

kompas = requests.get('https://www.kompas.com/')
beautify = BeautifulSoup(kompas.content,'html5lib')

news = beautify.find_all('div', {'class','most__list clearfix'})
arti = []
for each in news:
  nu = each.find('div', {'class','most__count'}).text
  title = each.find('h4', {'class','most__title'}).text
  lnk = each.a.get('href')
  rcount = each.find('div', {'class','most__read'}).text
  r = requests.get(lnk)
  soup = BeautifulSoup(r.text,'html5lib')
  content = soup.find('div', class_ = "read__content").text.strip()
  print(nu)
  print(title)
  print(lnk)
  print(rcount)

  arti.append({
    'Top Number': nu,
    'Headline': title,
    'Link': lnk,
    'Most Read': rcount,
    'Content':content
  })

df = pd.DataFrame(arti)
df.to_csv('kompas.csv', index=False)

csv文件的屏幕截图:

enter image description here

相关问题 更多 >