用Python抓取具有无限滚动的网站

15 投票

6 回答

33370 浏览

提问于 2025-04-18 00:15

我一直在做研究，发现我打算使用的Python库是scrapy。现在我想知道，使用scrapy来抓取那些有无限滚动的网站，应该怎么做比较好。在深入了解后，我发现有一个叫做selenium的库，它也有Python模块。我感觉应该有人已经用Scrapy和Selenium来抓取无限滚动的网站了。如果有人能给我指个例子，那就太好了。

自动化测试网络爬虫网页解析数据抓取 selenium scrapy 无限滚动爬虫框架

6 个回答

好问题！

挑战

在处理无限滚动的网页（或者动态加载的网站）时，我们无法确切知道新内容需要多长时间才能加载。因此，很难判断在新内容加载之前，我们应该等多久才能按一下 page-down 键。

而且，即使我们解决了第一个问题，我们也需要确保滚动的足够多，能够真正到达页面的底部。所以我们需要按 page-down 键足够多次，才能到达页面底部。

总结一下：如果网站加载速度不快，或者出于其他原因数据加载需要一些时间，我们不想过早退出。

我的解决方案

首先，定义一个 scroll_down 函数，这个函数接受一个驱动和一个正整数 n 作为输入。
这个函数包含一个 for-loop，它会按 n 次 page down，每次之间等待0.01秒（这个时间可以调整）。
把当前窗口的高度存储在一个叫 prev_height 的变量里。
在 for-loop 中，使用一个预定义的函数来向下滚动。
在每次循环中，暂停一段时间，让更多的内容加载（我等了10秒）。
暂停后，比较 prev_height 和当前的高度。如果它们相同，就退出；否则继续。

代码

滚动函数：

def scroll_down(elem, num):
    for _ in range(num):
        time.sleep(.01)
        elem.send_keys(Keys.PAGE_DOWN)

主代码：

    driver = <load driver etc.> 
    SCROLL_PAUSE_TIME = 10
    elem = driver.find_element_by_tag_name("body")
    prev_height = elem.get_attribute("scrollHeight")
    
    
    for i in range(0, 500):
        # note that the pause between page downs is only .01 seconds
        # in this case that would be a sum of 1 second waiting time
        scroll_down(elem,100)
        # Wait to allow new items to load
        time.sleep(SCROLL_PAUSE_TIME)

        #check to see if scrollable space got larger
        #also we're waiting until the second iteration to give time for the initial loading
        if elem.get_attribute("scrollHeight") == prev_height and i > 0:
            break
        prev_height = elem.get_attribute("scrollHeight")

注意：我在程序中使用的实际数字可能不适合你。但我相信这个解决方案本身是一个可靠的方法。此外，虽然这个解决方案对我来说非常可靠，但它也需要一些时间。

回答于 2025-04-18 由 Python大师

分享举报

对于无限滚动的数据，都是通过Ajax请求来获取的。首先打开网页浏览器，然后找到网络选项卡，点击像停止那样的图标来清除之前的请求记录。接着，向下滚动网页，这时你会看到新的请求出现。打开这个请求的头部信息，你可以找到请求的URL。把这个URL复制并粘贴到一个新的标签页中，你就能看到Ajax请求的结果了。只需根据请求的URL继续获取数据，直到页面的底部。

回答于 2025-04-18 由 Python大师

分享举报

from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.something.com")
lastElement = driver.find_elements_by_id("someId")[-1]
lastElement.send_keys(Keys.NULL)

这段代码会打开一个网页，找到页面中最底部的那个有特定 id 的元素，并把这个元素滚动到视野中。由于网页会不断加载更多内容，所以你需要不断查询这个驱动程序来获取最新的元素。我发现这样做在页面变得很大的时候会比较慢。主要的时间消耗在于调用 driver.find_element_*，因为我不知道有没有办法直接查询页面中的最后一个元素。

通过实验，你可能会发现页面动态加载的元素数量是有限的。最好是先写一个程序来加载这些元素，然后再调用 driver.find_element_*。

回答于 2025-04-18 由 Python大师

分享举报

这是段简短而简单的代码，对我来说运行得很好：

SCROLL_PAUSE_TIME = 20

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

posts = driver.find_elements_by_class_name("post-text")

for block in posts:
    print(block.text)

回答于 2025-04-18 由 Python大师

分享举报

你可以使用Selenium来抓取像Twitter或Facebook这样可以无限滚动的网站。

步骤1：通过pip安装Selenium。

pip install selenium

步骤2：使用下面的代码来自动化无限滚动并提取网页源代码。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

这个for循环可以让你遍历无限滚动的内容，然后你就可以提取加载出来的数据。

步骤3：如果需要，可以打印出这些数据。

回答于 2025-04-18 由 Python大师

分享举报

用Python抓取具有无限滚动的网站

6 个回答

好问题！

挑战

我的解决方案

代码

撰写回答