使用find_all in Beautiful Soup时,列表理解是如何工作的?

2024-05-23 17:09:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python新手,目前正在学习“自动化无聊的东西”教科书。我正在进行的计划应:

  • 在imgur中搜索照片类别
  • 在当前工作目录中创建唯一的名称和文件夹
  • 将所有生成的图像下载到驱动器上的文件夹中

我写的代码很有用,但我有几个问题

imgur的HTML:

<div class="cards">
                            <div id="SZPjHwz" class="post" data-tag1="" data-tag2="">
    <a class="image-list-link" href="/gallery/SZPjHwz" data-page="0">
        <img alt="" src="//i.imgur.com/SZPjHwzb.jpg">

代码:

#!/usr/bin/env python3

"""
This program:
    - Searches for a category of photos in imgur
    - Creates a unique name and folder in the Current Working Directory
    - Downloads all the resulting images to a folder on the drive (highest scoring of all time)

"""

# requests is used to retrieve the HTML of the page to download the pictures
# os is used to create the folders and save files to the drive
# sys is used to pass command line arguments to search
# bs4 BeautifulSoup is used to scrape and parse the HTML
# datetime is used to retrieve the date to create the folder

import requests, os, sys, bs4
from datetime import date

# Retrieve User Input
print("Search imgur: ")
search = input()

# Create folder destination on drive to save photos

today = date.today()
date = today.strftime("%b-%d-%Y")
folderName = date + " - imgur download - " + search
os.chdir("D:\Python\Projects - Automate the Boring Stuff with Python\Chapter 11\imgurDownloader")
if not os.path.exists(".\\" + folderName):
    os.makedirs(folderName)
    print("Directory " , folderName ,  " created ")
else:    
    print("Directory " , folderName ,  " already exists")

# Retrieve images from the link and save to the drive

res = requests.get("https://imgur.com/search?q=" + search)
res.raise_for_status()
imgurSoup = bs4.BeautifulSoup(res.text, "html.parser")
imgurPics = [i['href'] for i in imgurSoup.find_all('a', class_='image-list-link')]

if imgurPics == []:
    print("No results found.")
else:
    print("Downloading pictures...")
    for i in range(0,len(imgurPics)):
        pictureURL = "https://imgur.com" + imgurPics[i]
        imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb")
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

    print("Download successful")

问题:

  1. 我能够得到画廊里每张图片的链接。但我不知道为什么它会起作用。我在另一个堆栈溢出上发现了代码,我对他们如何使用列表理解来查找和创建列表感到困惑。为什么下面的方法不起作用?有没有办法使用select

  2. select和findall之间有什么区别

    imgurPics = imgurSoup.select('a', class_='image-list-link')
    
  3. 当我使用以下代码下载图片时,保存到我的文件夹中的图片无法打开。这里的问题是什么

    for i in range(0,len(imgurPics)):
            pictureURL = "https://imgur.com" + imgurPics[i]
            imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb")
            for chunk in res.iter_content(100000):
                imageFile.write(chunk)
            imageFile.close()
    

提前感谢您的帮助


Tags: thetoinforsearchdateisos
1条回答
网友
1楼 · 发布于 2024-05-23 17:09:47
  1. 链接理解只是将for循环放入一行程序中。所以

    imgurPics = []
    for i in imgurSoup.find_all('a', class_='image-list-link'):
        imgurPics.append(i['href']
    

与列表理解相同

imgurPics = [i['href'] for i in imgurSoup.find_all('a', class_='image-list-link')]

2..find_allselect基本上是相同的想法。这只是select接受CSS选择器的问题。我个人对.find_all()比较满意,因为这是我首先学到的。如果您想使用select,可以使用imgurPics = [i['href'] for i in imgurSoup.select('a.image-list-link')]

  1. 您的链接是指向Galleria的链接,而不是单个图片。所以你需要拉并保存一张单独的照片。所以你真正想要的是<img alt="" src="//i.imgur.com/SZPjHwzb.jpg">,而不是href。第二,你没有读到这个回答。您有res.iter_content(100000),但这是对res = requests.get("https://imgur.com/search?q=" + search)的引用。您需要从正在迭代的链接中获取图像

  2. 小心使用与函数/模块同名的变量。您可以使用from datetime import date,但随后使用date = today.strftime("%b-%d-%Y")。我想说的是尽量避免这种情况。我将变量更改为dateStr

  3. 循环时不需要创建索引范围(除非您正在使用或需要元素的索引,或者有一个计数器。即使这样,如果我需要带索引的索引,那么我只需要使用enumerate()),但不需要在i in range(0,len(imgurPics))上执行for循环。只要简单地做for i in imgurPics。这将在整个列表中循环

因此,请查看代码。我记下了我在哪里做的更改,以便您可以看到

import requests, os, sys, bs4
from datetime import date

# Retrieve User Input
search = input("Search imgur:\n\n") #< - made change here. You can do what you had in 1 line

# Create folder destination on drive to save photos
today = date.today()
dateStr = today.strftime("%b-%d-%Y")  #< - changed the variable name
folderName = dateStr + " - imgur download - " + search
os.chdir("D:\Python\Projects - Automate the Boring Stuff with Python\Chapter 11\imgurDownloader")
if not os.path.exists(".\\" + folderName):
    os.makedirs(folderName)
    print("Directory " , folderName ,  " created ")
else:    
    print("Directory " , folderName ,  " already exists")

# Retrieve images from the link and save to the drive

res = requests.get("https://imgur.com/search?q=" + search)
res.raise_for_status()
imgurSoup = bs4.BeautifulSoup(res.text, "html.parser")
imgurPics = [i['src'] for i in imgurSoup.find_all('img') if 'loaders' not in i['src']] #<  change here to get the img tag with the src attribute, which is the link to the image. This will possibly include an extra link that we don't want, so I added the if part as well

if imgurPics == []:
    print("No results found.")
else:
    print("Downloading pictures...")
    for pic in imgurPics:  #< - change the loop syntax slightly
        pictureURL = "https:" + pic  #< - had to change this slightly
        resPic = requests.get(pictureURL)  #< - added this like to pull the image http
        imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb")
        for chunk in resPic.iter_content(100000): #< - need to point to that image http repsonse, so used the response I got 2 lines up `resPic`
            imageFile.write(chunk)
        imageFile.close()

    print("Download successful")

相关问题 更多 >