如何使用Selenium在Python中遍历项列表并提取图片?
我正在尝试用Selenium从一个网站上抓取数据。这个网站上有一系列的物品,每个物品都有不同的属性,并且有一些特定的data-aut-id属性。我正在用一个循环来遍历这些物品并提取数据,但我遇到了一个问题,就是在循环进行到第11次的时候,'src'属性的抓取就停止了。
这是我的代码,希望有人能帮我解决这个问题。
url = "https://www.olx.co.id/mobil_c86"
ul_element = driver.find_element(By.CSS_SELECTOR, "ul[data-aut-id='itemsList']")
time.sleep(1)
li_elements = ul_element.find_elements(By.CSS_SELECTOR, "li[data-aut-id='itemBox']")
for li_element in li_elements:
time.sleep(3)
try:
# Extract link
link_element = li_element.find_element(By.CSS_SELECTOR, "a")
link = link_element.get_attribute("href")
links.append(link)
except NoSuchElementException:
links.append(None)
try:
# Extract image source
image_element = li_element.find_element(By.CSS_SELECTOR, "img")
image_source = image_element.get_attribute("src")
image_sources.append(image_source)
except NoSuchElementException:
image_sources.append(None)
# Extract price,year,title,location
2 个回答
0
每个
link_elements = driver.find_element(By.CSS_SELECTOR,"._1DNjI a")
对于图片,你可以使用这个方法。
img_elements = driver.find_element(By.CSS_SELECTOR,"._1DNjI a ._3UrC5 img")
1
你列表底部的元素是异步加载的,也就是说它们不会一次性全部显示出来。你需要通过一些用户操作,比如滚动页面,来启动它们的加载。
举个例子,你可以通过 ActionChains
滚动到“加载更多”按钮,然后再进行获取图片源属性的操作。这样至少可以启动底部列表项的渲染逻辑。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
links = []
image_sources = []
wait = WebDriverWait(driver, 20)
url = "https://www.olx.co.id/mobil_c86"
driver.get(url)
load_more = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '[data-aut-id=btnLoadMore]')))
ActionChains(driver).scroll_to_element(load_more).perform()
ul_element = driver.find_element(By.CSS_SELECTOR, "ul[data-aut-id='itemsList']")
li_elements = ul_element.find_elements(By.CSS_SELECTOR, "li[data-aut-id='itemBox']")
for li_element in li_elements:
link_element = li_element.find_element(By.CSS_SELECTOR, "a")
link = link_element.get_attribute("href")
links.append(link)
image_element = li_element.find_element(By.CSS_SELECTOR, "img")
image_source = image_element.get_attribute("src")
image_sources.append(image_source)
print(image_source)