使用BeautifulSoup解析Google新闻

3条回答

网友

1楼 · 编辑于 2024-05-18 23:27:40

您试图解析的网站是动态的（这意味着js需要在浏览器中运行，以便呈现您看到的HTML）

因此，使用requests获取HTML只会在运行js之前返回整个页面源代码

因此，要解析动态网站，您必须使用类似selenium的东西在浏览器中运行js，然后您可以从中获取HTML文件，并使用BeautifulSoup解析它

网友

2楼 · 编辑于 2024-05-18 23:27:40

我回答了类似的问题here

代码（我在这里添加了两行额外的代码，用于提取文章摘要）：

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

for headings in soup.findAll('div', class_='dbsr'):
    title = headings.find('div', class_='JheGif nDgy9d').text
    summary = headings.find('div', class_='Y3v8qd').text
    link = headings.a['href']
    print(title)
    print(summary)
    print(link)
    print()

或者，您可以从SerpApi下载Google News Result API

JSON的一部分：

"news_results": [
  {
    "position": 1,
    "link": "https://abc7chicago.com/eisenhower-expressway-crash-wrong-way-chicago-traffic/10456033/",
    "title": "Eisenhower Expressway crashes: 5 killed in separate I-290 wrong-way crashes in Chicago, Forest Park",
    "source": "WLS-TV",
    "date": "16 hours ago",
    "thumbnail": "https://serpapi.com/searches/606340870574f50571da7bfd/images/2f5ade266f837059c67526895fb3916f7518aefbb5215951bb79d83871345dedc741519fefe9c85a8abb834360552c65898af6461c5709de.jpeg"
  }
]

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "chicago",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
   print(f"Article summary: {news_result['snippet']}\n")

输出：

Article summary: A Chicago-based marijuana cultivator and dispenser that has rapidly grown 
into one of the nation's biggest pot firms is under federal ...

Article summary: With 2021 being a pivotal season for the Chicago Cubs and the direction of 
the franchise, here are three bold predictions you may see play out ...

Article summary: The Chicago Blackhawks have lacked puck management. With a team with high 
offensive upside in the Carolina Hurricanes, this cannot ...

Article summary: Chicago, IL - Lírica, Chicagos New Latin-American Inspired Restaurant and 
Bar,

Article summary: A father of three is lucky to be alive after what he describes as a failed 
carjacking that left him running for his life, and his car riddled with 
bullets, ...

Article summary: Robservations on the media beat: VSiN, the Las Vegas-based sports 
information network founded by a group of Chicago entrepreneurs in ...

Article summary: In the day's first reported shooting a man was shot about 2 a.m. in the 
2700 block of South Karlov Avenue.

Article summary: Cameo, the Chicago-based startup that lets users buy video shout-outs from 
celebrities, has banked $100 million in Series C funding   which ...

Article summary: CHICAGO (CBS) — Although much of the contemporary discussion of COVID-19 
center around rolling out the vaccines, there are still people ...

免责声明，我为SerpApi工作

网友

3楼 · 编辑于 2024-05-18 23:27:40

这就是诀窍：

soup.text

其中包含结果的文本

要分析URL的地址，请执行以下操作：

for link in soup.find_all('a', href=True):
    print(link['href'])

完整代码：

from bs4 import BeautifulSoup
import requests

url_search = 'https://www.google.com/search?biw=2513&bih=817&tbm=nws&sxsrf=ALeKk03-PpUbGxYQpIcp6OcJULFASqa_tA%3A1612525818528&ei=-jAdYKb1H9yf1fAPxrac8AU&q=test&oq=test&gs_l=psy-ab.3..0l10.1628056.1628435.0.1628556.4.4.0.0.0.0.112.340.3j1.4.0....0...1c.1.64.psy-ab..0.4.338....0.H4wnL6N3kBo'
code=requests.get(url_search)
soup=BeautifulSoup(code.text,"html.parser")
print(soup.text)

for link in soup.find_all('a', href=True):
    print(link['href'])

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulSoup解析Google新闻

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >