使用Selenium刮取图像时出现意外行为

2024-05-23 19:38:13 发布

您现在位置:Python中文网/ 问答频道 /正文

有人能帮我理解为什么我在这里的函数不返回我作为参数提供的url列表中的每个url,以及为什么我得到以下输出吗?我只是尝试返回每个项目的url和列表,以及每个url的项目对应的所有图像。你知道吗

beta_test_items = ['https://www.facebook.com/marketplace/item/2009940172578816',
 'https://www.facebook.com/marketplace/item/1591865710899243']

from selenium import webdriver
from time import sleep
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def scrape_item_details(beta_test_items):
    #finish this function
    for url in beta_test_items:
        images = []
        driver.get(url)
        sleep(3)
        image_element = driver.find_element_by_xpath('//img[contains(@class, "_5m")]')
        images = [image_element.get_attribute('src')]

        try:
            previous_and_next_buttons = driver.find_elements_by_xpath("//i[contains(@class, '_3ffr')]")
            next_image_button = previous_and_next_buttons[1]
            print(next_image_button.text)
            if  next_image_button.is_displayed():
                next_image_button.click()

                image_element = driver.find_element_by_xpath('//img[contains(@class, "_5m")]')
                print(image_element.get_attribute('src'))
                sleep(2)   

                if image_element.get_attribute('src') in images:
                    pass
                else:
                    images.append(image_element.get_attribute('src'))

            else:
                pass
        except:
            pass

        yield(url, images)

if __name__ == '__main__':

当我尝试当前运行它时,我得到了以下输出,我不知道为什么在第二张照片被附加到图像列表后,它会在第一个url上停止:

In [46]: scrape_item_details(beta_items_list)
['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9']
Next
https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD
Out[46]: 
('https://www.facebook.com/marketplace/item/2009940172578816',
 ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
  'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD'])

----更新---- 我将return改为yield,运行list(scrape_item_details(beta_test_items))时得到以下输出:

[('https://www.facebook.com/marketplace/item/2009940172578816',
  ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27973017_1685674758138175_781683034741350935_n.jpg?oh=e2aa32aa73f3bb9061e861bd1ea306cb&oe=5B0741FF']),
 ('https://www.facebook.com/marketplace/item/1591865710899243',
  ['https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27750896_2002108023449096_2229019388723795634_n.jpg?oh=26d3fe06595affdcbd142754766fe934&oe=5B0933C9',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27655331_2002108026782429_4575620607831413757_n.jpg?oh=a7c94bc2b8ef8b39bc65291b641f7953&oe=5B0A11DD',
   'https://scontent-atl3-1.xx.fbcdn.net/v/t1.0-9/27973017_1685674758138175_781683034741350935_n.jpg?oh=e2aa32aa73f3bb9061e861bd1ea306cb&oe=5B0741FF'])]

不确定为什么第一个url中的图像会重复作为第二个url的输入?你知道吗


Tags: fromhttpsimageimporturlnetelementitem