如何用Selenium加载网页的所有元素?
我的目标是获取所有在class="Item--content--12o-RdR"这个类里的outerHTML。
我想要抓取的网页地址是 https://item.taobao.com/item.htm?id=767876514653
但是,有些元素即使我试着去选取它们,还是加载不出来。例如: 颜色分类:
这些元素只有在我检查网页元素的时候才能看到。
如果能帮忙就太好了,我对抓取网页的技术还比较陌生。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException
def load_and_capture_content(url):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
driver.get(url)
element_content = None
try:
xpath = '//*[@id="root"]/div/div[2]/div[2]/div[2]'
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, xpath)))
# Locate the target element
target_element = driver.find_element(By.XPATH, xpath)
# Use JavaScript to get the outerHTML to ensure dynamic attributes are included
element_content = driver.execute_script("return arguments[0].outerHTML;", target_element)
except TimeoutException:
print("Timed out waiting for the element to load. Capturing only the full page content.")
full_page_html = driver.page_source
driver.quit()
return element_content, full_page_html
url = "https://item.taobao.com/item.htm?id=767876514653"
if __name__ == "__main__":
element_content, full_page_html = load_and_capture_content(url)
if element_content:
print("Element content captured.")
with open('zxpath.html', 'w', encoding='utf-8') as file:
file.write(element_content)
print("Full page content captured.")
with open('xfull_page_content.html', 'w', encoding='utf-8') as file:
file.write(full_page_html)
我尝试过:
- 实现滚动加载
- 等待更长时间
- 直接选取特定的元素(我用过XPath和CSS选择器)
但是这些方法都没有效果。
0 个回答
暂无回答