用BeautifulSoup从页面中刮取所有结果

2024-04-20 09:14:47 发布

您现在位置:Python中文网/ 问答频道 /正文

                              **Update**
         ===================================================

好吧,伙计们,到目前为止还不错。我有一些代码可以让我抓取图像,但它以一种奇怪的方式存储它们。它先下载40多张图片,然后在之前创建的“kittens”文件夹中创建另一个“kittens”文件夹并重新开始(下载与第一个文件夹中相同的图像)。我怎样才能改变它?代码如下:

^{pr2}$

====================================================================================

我想写一个蜘蛛,从某个页面上刮下小猫的图片。我有个小问题,因为我的蜘蛛只得到前15张图片。我知道这可能是因为页面在向下滚动后加载了更多的图像。我如何解决这个问题? 代码如下:

import requests
from bs4 import BeautifulSoup as bs
import os


url = 'https://www.pexels.com/search/cute%20kittens/'

page = requests.get(url)
soup = bs(page.text, 'html.parser')

image_tags = soup.findAll('img')

if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        if source.status_code == 200:
            with open('kitten-' + str(x) + '.jpg', 'wb') as f:
                f.write(requests.get(url).content)
                f.close()
                x += 1
    except:
        pass

Tags: 代码图像imageimport文件夹urlgetbs
1条回答
网友
1楼 · 发布于 2024-04-20 09:14:47

由于站点是动态的,您需要使用浏览器操作工具,如selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import os
driver = webdriver.Chrome()
driver.get('https://www.pexels.com/search/cute%20kittens/')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(0.5)
  new_height = driver.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
     break
  last_height = new_height

image_urls = [i['src'] for i in soup(driver.page_source, 'html.parser').find_all('img')]
if not os.path.exists('kittens'):
  os.makedirs('kittens')
os.chdir('kittens')
with open('kittens.txt') as f:
  for url in image_urls:
    f.write('{}\n'.format(url))

相关问题 更多 >