我正在清理这个网站https://www.dccomics.com/comics
如果一直向下滚动,您将发现一个带有分页的browse comics
部分
我想从第1-5页中删掉所有25本漫画
这是我目前拥有的代码
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
class Scraper():
comics_url = "https://www.dccomics.com/comics"
driver = webdriver.Chrome("C:\\laragon\\www\\Proftaak\\chromedriver.exe")
# driver = webdriver.Chrome("C:\\laragon\\www\\proftaak-2020\\Proftaak-scraper\\chromedriver.exe")
driver.get(comics_url)
driver.implicitly_wait(500)
current_page = 2
def GoToComic(self):
for i in range(1,3):
time.sleep(2)
goToComic = self.driver.find_element_by_xpath(f'//*[@id="dcbrowseapp"]/div/div/div/div[3]/div[3]/div[2]/div[{i}]/a/img')
self.driver.execute_script("arguments[0].click();", goToComic)
self.ScrapeComic()
self.driver.back()
self.ClearFilter()
if self.current_page != 6:
if i == 25:
self.current_page +=1
self.ToNextPage()
def ScrapeComic(self):
self.driver.implicitly_wait(250)
title = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'page-title')]")))]
price = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'buy-container-price')]/span[contains(@class, 'price')]")))]
available = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'sale-status-container')]/span[contains(@class, 'sale-status')]")))]
try:
description = [my_elem.text for my_elem in WebDriverWait(self.driver, 5).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "field-items")))]
except:
return
def ToNextPage(self):
if self.current_page != 6:
nextPage = self.driver.find_element_by_xpath(f'//*[@id="dcbrowseapp"]/div/div/div/div[3]/div[3]/div[3]/div[1]/ul/li[{self.current_page}]/a')
self.driver.execute_script("arguments[0].click();", nextPage)
self.GoToComic()
def AcceptCookies(self):
self.driver.implicitly_wait(250)
cookies = self.driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/button')
self.driver.execute_script("arguments[0].click();", cookies)
self.driver.implicitly_wait(100)
def ClearFilter(self):
self.driver.implicitly_wait(500)
clear_filter = self.driver.find_element_by_class_name('clear-all-action')
self.driver.execute_script("arguments[0].click();", clear_filter)
def QuitDriver(self):
self.driver.quit()
scraper = Scraper()
scraper.AcceptCookies()
scraper.ClearFilter()
scraper.GoToComic()
scraper.QuitDriver()
现在它将第一页的前25个漫画刮得很好,但是当我转到第二页时,问题出现了,它将第2页的第一个漫画刮得很好,但是当我从漫画返回到该页时,过滤器将重置,它将再次从第1页开始
我怎样才能使它从正确的页面恢复,或者在返回漫画页面之前,过滤器始终处于关闭状态?我尝试过使用会话/cookies之类的东西,但似乎过滤器没有以任何可能的方式保存
网页https://www.dccomics.com/comics中的浏览漫画部分没有5页面作为分页,但只有3页面。要使用Selenium和python从每个漫画中刮取名称,您必须为
visibility_of_all_elements_located()
归纳WebDriverWait,并且可以使用以下基于xpath的Locator Strategies:代码块:
控制台输出:
浏览器
back
功能可将您带到以前访问过的URL。在您提到的网站中,所有页面都使用单个URL(看起来它们是由JS加载到同一页面的,因此新的漫画页面不需要新的URL)这就是为什么当您从第二页的第一个漫画返回时,您只需重新加载
https://www.dccomics.com/comics
,其中第一页作为默认页面加载我还可以看到,没有专门的控制从喜剧细节回到名单
因此,唯一的方法是将当前页面的编号存储在代码中的某个位置,然后在从漫画细节页面返回后切换到该具体编号
相关问题 更多 >
编程相关推荐