使用BeautifulSoup解析Google新闻

2024-05-18 23:27:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图解析来自新闻搜索“测试”Google的每个新闻元素的标题和文本

搜索URL为:https://www.google.com/search?biw=2513&tbm=nws&sxsrf=ALeKk02tev7vVkPiKz3E20Lih1-7Ol8SBw%3A1612526096099&ei=EDIdYNXbBdmc1fAPid678A0&q=test&oq=test&gs_l=psy-ab.3..0l10.25658.26016.0.26105.4.4.0.0.0.0.74.204.3.3.0....0...1c.1.64.psy-ab..1.3.202....0.y_53L-Gyyyw

Each element contains the g-card tag:

enter image description here

当我尝试使用以下方法进行分析时:

from bs4 import BeautifulSoup
import requests

url="https://www.google.com/search?q=bitcoin&sxsrf=ALeKk00r2AqKlBSgzF1T_zG1uQBaBKSN1g:1612525788197&source=lnms&tbm=nws&sa=X&ved=2ahUKEwji6q7W1tLuAhW0ShUIHSGmBpoQ_AUoAXoECBcQAw&biw=2513&bih=1315"
code=requests.get(url)
soup=BeautifulSoup(code.text,"html.parser")
soup.find_all("g-card")

结果是一个空列表:

[]

我应该如何修改find_all以返回允许从每个结果中选择标题和文本的新闻结果


Tags: httpstest文本com标题searchwwwgoogle
3条回答

您试图解析的网站是动态的(这意味着js需要在浏览器中运行,以便呈现您看到的HTML)

因此,使用requests获取HTML只会在运行js之前返回整个页面源代码

因此,要解析动态网站,您必须使用类似selenium的东西在浏览器中运行js,然后您可以从中获取HTML文件,并使用BeautifulSoup解析它

我回答了类似的问题here

代码(我在这里添加了两行额外的代码,用于提取文章摘要):

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

for headings in soup.findAll('div', class_='dbsr'):
    title = headings.find('div', class_='JheGif nDgy9d').text
    summary = headings.find('div', class_='Y3v8qd').text
    link = headings.a['href']
    print(title)
    print(summary)
    print(link)
    print()

或者,您可以从SerpApi下载Google News Result API

JSON的一部分:

"news_results": [
  {
    "position": 1,
    "link": "https://abc7chicago.com/eisenhower-expressway-crash-wrong-way-chicago-traffic/10456033/",
    "title": "Eisenhower Expressway crashes: 5 killed in separate I-290 wrong-way crashes in Chicago, Forest Park",
    "source": "WLS-TV",
    "date": "16 hours ago",
    "thumbnail": "https://serpapi.com/searches/606340870574f50571da7bfd/images/2f5ade266f837059c67526895fb3916f7518aefbb5215951bb79d83871345dedc741519fefe9c85a8abb834360552c65898af6461c5709de.jpeg"
  }
]

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "chicago",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
   print(f"Article summary: {news_result['snippet']}\n")

输出:

Article summary: A Chicago-based marijuana cultivator and dispenser that has rapidly grown 
into one of the nation's biggest pot firms is under federal ...

Article summary: With 2021 being a pivotal season for the Chicago Cubs and the direction of 
the franchise, here are three bold predictions you may see play out ...

Article summary: The Chicago Blackhawks have lacked puck management. With a team with high 
offensive upside in the Carolina Hurricanes, this cannot ...

Article summary: Chicago, IL - Lírica, Chicagos New Latin-American Inspired Restaurant and 
Bar,

Article summary: A father of three is lucky to be alive after what he describes as a failed 
carjacking that left him running for his life, and his car riddled with 
bullets, ...

Article summary: Robservations on the media beat: VSiN, the Las Vegas-based sports 
information network founded by a group of Chicago entrepreneurs in ...

Article summary: In the day's first reported shooting a man was shot about 2 a.m. in the 
2700 block of South Karlov Avenue.

Article summary: Cameo, the Chicago-based startup that lets users buy video shout-outs from 
celebrities, has banked $100 million in Series C funding   which ...

Article summary: CHICAGO (CBS) — Although much of the contemporary discussion of COVID-19 
center around rolling out the vaccines, there are still people ...

免责声明,我为SerpApi工作

这就是诀窍:

soup.text

其中包含结果的文本

要分析URL的地址,请执行以下操作:

for link in soup.find_all('a', href=True):
    print(link['href'])

完整代码:

from bs4 import BeautifulSoup
import requests

url_search = 'https://www.google.com/search?biw=2513&bih=817&tbm=nws&sxsrf=ALeKk03-PpUbGxYQpIcp6OcJULFASqa_tA%3A1612525818528&ei=-jAdYKb1H9yf1fAPxrac8AU&q=test&oq=test&gs_l=psy-ab.3..0l10.1628056.1628435.0.1628556.4.4.0.0.0.0.112.340.3j1.4.0....0...1c.1.64.psy-ab..0.4.338....0.H4wnL6N3kBo'
code=requests.get(url_search)
soup=BeautifulSoup(code.text,"html.parser")
print(soup.text)

for link in soup.find_all('a', href=True):
    print(link['href'])

相关问题 更多 >

    热门问题