为什么硒只刮第一页？

import scrapy from selenium import webdriver from selenium.webdriver.common.keys import Keys from scrapy import Spider, Request from scrapy import signals from scrapy.http import HtmlResponse import time import os class WebnewsSpider(scrapy.Spider): name = 'webnews' allowed_domains = ['www.hamariweb.com'] start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=3'] def __init__ (self): options = webdriver.ChromeOptions() options.add_argument("--start-maximized") self.driver=webdriver.Chrome("C://Users//hammad//Downloads// chromedriver",chrome_options=options) def parse(self, response): self.driver.get(response.url) pause_time = 1 last_height = self.driver.execute_script("return document.body.scrollHeight") #start = datetime.datetime.now() for i in range(10): self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);") time.sleep(pause_time) url2=response.xpath('.//*[@class="news_img"]/a/@href').extract() print("\n\n\n",url2,"\n\n\n") new_height = self.driver.execute_script("return document.body.scrollHeight") self.driver.close() #print("\n\n",len(l))

1条回答

网友

1楼 · 发布于 2024-04-26 12:13:18

步骤：

在当前视图中查找最新的文章/文本。你知道吗
在最新发布上执行向下滚动以触发“加载更多数据”

更多信息：

你只需做document.querySelectorAll('#CatNewsList > div').length 结果将是职位的数量。迭代每个帖子并提取URL:

CSS选择器：

#CatNewsList > div .news_img > a

现在您可以获取标签'href'并提取链接。你知道吗

当到达最后一篇文章时，执行scroll to bottom并等待XPATH://p[text()='loading more news... ']元素不可见。你知道吗

像这样，你可以肯定的是，它没有加载任何新的页面。保持以前的文章大小，并开始从它解析到下一个长度的文章。你知道吗

重复。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章