使用Selenium或Beautiful soap滚动页面的替代方法?

2024-05-14 02:39:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图抓取一个有无限滚动页面的新闻页面(thenextweb.com)

我已经写了一个函数来滚动,但它需要太多的时间滚动。我不得不使用time.sleep(),因为我的互联网连接很弱,有时间加载新页面

这是我的向下滚动功能,我使用了此问题的解决方案:https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python"

def scrolldown(urltoscroll):
    browser.get(urltoscroll)
    last_height = browser.execute_script("return document.body.scrollHeight")
    next_button = browser.find_element_by_xpath('//*[@id="channelPaginate"]')
    while True:
        
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(6)
        next_button.click()
        time.sleep(8)
        new_height = browser.execute_script("return document.body.scrollHeight")
        time.sleep(6)
        if new_height == last_height:
            break
        last_height = new_height

有没有其他方法可以更轻松地处理这些类型的页面

多谢各位

编辑:我要刮取的链接:https://thenextweb.com/plugged/". 我想得到文章hrefs


Tags: browsercomnewexecutetime时间scriptbody
2条回答

下面是一个示例selenium代码段,您可以将其用于此类型的目的。它会转到“Enumerate python tutorial”(枚举python教程)上youtube搜索结果的url并向下滚动,直到找到标题为“Enumerate python tutorial(2020)”的视频

driver.get('https://www.youtube.com/results?search_query=enumerate+python')
target = driver.find_element_by_link_text('Enumerate python tutorial(2020).')
target.location_once_scrolled_into_view

您还可以将其应用于新闻抓取代码

嗯,向下滚动操作似乎触发了一个API调用,您可以使用requests模块模拟该调用来加载每个页面

以下是最新新闻部分的示例:

  import requests
  from bs4 import BeautifulSoup

  ## The function which read the news by page
  def getNews(page):
      headers = {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
          'Accept': 'text/html, */*; q=0.01',
          'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
          'X-Requested-With': 'XMLHttpRequest',
          'Connection': 'keep-alive',
          'Pragma': 'no-cache',
          'Cache-Control': 'no-cache',
      }

      params = (
          ('page', page),
          ('slug', ''),
          ('taxo', ''),
      )

      response = requests.get('https://thenextweb.com/wp-content/themes/cyberdelia/ajax/partials/grid-pager.php', headers=headers, params=params)
      return response.content
  
  ## Loop through page
  for page in range(2):
      print("Page", page)
      soup = BeautifulSoup(getNews(page))

      ## Some simple data processing
      for news in soup.find_all('li'):
          news_div  = news.find('div',{'class':'story-text'})
          #Check if the li contains the desired info
          if news_div == None: continue
          print("News headline:", news_div.find('a').text.strip())
          print("News link:", news_div.find('a').get('href'))
          print("News extract:", news_div.find('p', {'class':'story-chunk'}).text.strip())
          print("#"*10)
      print()

输出

Page 0
##########
News headline: Can AI convincingly answer existential questions?
News link: https://thenextweb.com/neural/2020/07/06/study-tests-whether-ai-can-convincingly-answer-existential-questions/
News extract: A new study has explored whether AI can provide more attractive answers to existential questions than history's most influential ...
##########
News headline: Here are the Xbox Series X games we think Microsoft will show off on July 23
News link: https://thenextweb.com/gaming/2020/07/06/xbox-series-x-games-microsoft-show-off-july-23/
News extract: Microsoft will be showing off its first-party Xbox Series X games at the end of the month. We can guess what we might be ...
##########
News headline: Uber buys Postmates for $2.65 billion — and traders are into it
News link: https://thenextweb.com/hardfork/2020/07/06/uber-stock-postmates-buyout-acquisition-billion/
News extract: Uber's $2.65 billion Postmates all-stock acquisition comes less than a month after talks to buy rival GrubHub fell through. ...

相关问题 更多 >