从Python和BeautifulSoup中的搜索结果中刮取URL

2条回答

网友

1楼 · 编辑于 2024-06-16 11:03:45

谷歌新闻可以用requests和beautifulsoup轻松浏览。使用user-agent从那里提取数据就足够了

签出SelectorGadgetChrome扩展，通过单击要提取的元素直观地获取CSS选择器

如果您只想从谷歌新闻中提取URL，那么它就简单到：

for result in soup.select('.dbsr'):
    link = result.a['href']
# 10 links here..

代码和example that scrape more in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": "yahoo finance BTC",
    "hl": "en",
    "gl": "us",
    "tbm": "nws",
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.dbsr'):
    link = result.a['href']
    print(link)

  -
'''
https://finance.yahoo.com/news/riot-blockchain-reports-record-second-203000136.html
https://finance.yahoo.com/news/el-salvador-not-require-bitcoin-175818038.html
https://finance.yahoo.com/video/bitcoin-hovers-around-50k-paypal-155437774.html
... other links
'''

或者，您可以使用SerpApi中的Google News Results API来实现相同的结果。这是一个免费的付费API

不同之处在于，您不必弄清楚如何提取元素，随着时间的推移维护解析器，绕过Google的块

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "coca cola",
  "tbm": "nws",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

  -
'''
Title: Coca-Cola Co. stock falls Monday, underperforms market
Link: https://www.marketwatch.com/story/coca-cola-co-stock-falls-monday-underperforms-market-01629752653-994caec748bb
... more results
'''

顺便说一下，我写了一篇blog post关于如何通过视觉表现更详细地抓取谷歌新闻（包括分页）

Disclaimer, I work for SerpApi.

网友

2楼 · 编辑于 2024-06-16 11:03:45

您可能会遇到这样一个问题，即请求和bs4可能不是您试图实现的目标的最佳工具。正如巴尔德曼在另一篇评论中所说，使用谷歌搜索api将更容易

此代码：

from googlesearch import search

tickers = ['GME', 'TSLA', 'BTC']
links_list = []
for ticker in tickers:
    ticker_links = search(ticker, stop=25)
    links_list.append(ticker_links)

将列出每个股票代码在谷歌上排名前25位的链接，并将该列表附加到另一个列表中。Yahoo finance肯定会在链接列表中，一个基于关键字的简单解析器将获得特定股票代码的Yahoo finance url。您还可以根据自己的意愿调整search（）函数中的搜索条件，例如ticker+“yahoo finance”

相关问题更多 >

编程相关推荐

热门问题

热门文章