无法从延迟加载的页面源获取图像URL

0 投票
1 回答
28 浏览
提问于 2025-04-14 17:20

我想从一个新闻页面上抓取新闻图片和标题,然后把它们用在一个显示屏上(Xibo)。简单来说,我只想要这个网址的前三行主要内容,不要任何头部或底部的信息,也不需要额外的代码或脚本。只要中等大小的图片和下面的标题就可以了。我想每天抓取这些图片和标题,然后用Flask生成一个简单的HTML页面,供内容管理系统(CMS)读取。

我了解到在这种情况下,我需要用到selenium来获取渲染后的页面,对吗?在下面的代码中,我在正确找到图片链接方面遇到了困难。这段代码可以读取页面并滚动,但找不到任何图片。我试过一些嵌套的div,但也没有成功。有人能给我指个方向,告诉我怎么获取图片链接(最终也包括标题)吗?

#News feed test for Xibo Signage
#from flask import Flask, render_template
from markupsafe import Markup
#app=Flask(__name__) 
from urllib.request import Request, urlopen

from bs4 import BeautifulSoup
import requests

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#installed chrome driver in scripts so don't need next lines?    
#chromedriver_path = '...'
driver = webdriver.Chrome()

url = "https://news.clemson.edu/tag/extension/"

driver.get(url)

# wait (up to 20 seconds) until the images are visible on page
images = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "site-main")))
# scroll to the last image, so that all images get rendered correctly
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', images[-1])
time.sleep(2)

# PRINT URLS USING SELENIUM -for test (will pass to Flask)

print('Selenium')
for img in images:
    print(img.get_attribute('src'))



#@app.route('/') 
#def home():
#   return render_template('home.html',thumbnailmk=thumbnailmk)

#if __name__ == '__main__':
#   app.run(host='0.0.0.0')
#   app.run(debug=True)

1 个回答

0

这里的问题是你没有选择任何图片,试着改变一下你的思路,专注于你真正想要找到的东西:

for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
    print(e.get_attribute('data-srcset').split()[0])
示例

这个示例指向了 data-srcset 属性,并获取第一个图片的链接:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

url = "https://news.clemson.edu/tag/extension/"

driver.get(url)

for e in driver.find_elements(By.CSS_SELECTOR,'article img'):
    print(e.get_attribute('data-srcset').split()[0])

不过你不一定要使用 selenium,你也可以用 requests

import requests
from bs4 import BeautifulSoup

url = "https://news.clemson.edu/tag/extension/"

soup = BeautifulSoup(requests.get(url, headers={'user-agent':'some-agent'}).text)

for e in soup.select('article img.lazyload'):
    print(e.get('data-src'))

https://news.clemson.edu/wp-content/uploads/2023/04/ag-and-art-scaled.jpg
https://news.clemson.edu/wp-content/uploads/2024/03/AgTech_Forum_FeatureImage.jpg
...
https://news.clemson.edu/wp-content/uploads/2023/09/Cooperative-Extension-RGB-color_featured.jpg
https://news.clemson.edu/wp-content/uploads/2023/09/20141107-simpson-5911-X5.jpg
https://news.clemson.edu/wp-content/uploads/2023/09/TailgateFoodSafety.jpg

撰写回答