如何用Selenium加载网页的所有元素?

0 投票
0 回答
25 浏览
提问于 2025-04-12 09:24

我的目标是获取所有在class="Item--content--12o-RdR"这个类里的outerHTML。

我想要抓取的网页地址是 https://item.taobao.com/item.htm?id=767876514653

但是,有些元素即使我试着去选取它们,还是加载不出来。例如:

 颜色分类:

这些元素只有在我检查网页元素的时候才能看到。

如果能帮忙就太好了,我对抓取网页的技术还比较陌生。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException

def load_and_capture_content(url):
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")

    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    driver.get(url)

    element_content = None

    try:
        xpath = '//*[@id="root"]/div/div[2]/div[2]/div[2]'
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, xpath)))

        # Locate the target element
        target_element = driver.find_element(By.XPATH, xpath)

        # Use JavaScript to get the outerHTML to ensure dynamic attributes are included
        element_content = driver.execute_script("return arguments[0].outerHTML;", target_element)
    except TimeoutException:
        print("Timed out waiting for the element to load. Capturing only the full page content.")

    full_page_html = driver.page_source
    driver.quit()

    return element_content, full_page_html

url = "https://item.taobao.com/item.htm?id=767876514653"
if __name__ == "__main__":
    element_content, full_page_html = load_and_capture_content(url)
    
    if element_content:
        print("Element content captured.")
        with open('zxpath.html', 'w', encoding='utf-8') as file:
            file.write(element_content)

    print("Full page content captured.")
    with open('xfull_page_content.html', 'w', encoding='utf-8') as file:
        file.write(full_page_html)

我尝试过:

  1. 实现滚动加载
  2. 等待更长时间
  3. 直接选取特定的元素(我用过XPath和CSS选择器)

但是这些方法都没有效果。

0 个回答

暂无回答

撰写回答