使用find_all in Beautiful Soup时，列表理解是如何工作的？

<div class="cards"> <div id="SZPjHwz" class="post" data-tag1="" data-tag2=""> <a class="image-list-link" href="/gallery/SZPjHwz" data-page="0"> <img alt="" src="//i.imgur.com/SZPjHwzb.jpg">

#!/usr/bin/env python3 """ This program: - Searches for a category of photos in imgur - Creates a unique name and folder in the Current Working Directory - Downloads all the resulting images to a folder on the drive (highest scoring of all time) """ # requests is used to retrieve the HTML of the page to download the pictures # os is used to create the folders and save files to the drive # sys is used to pass command line arguments to search # bs4 BeautifulSoup is used to scrape and parse the HTML # datetime is used to retrieve the date to create the folder import requests, os, sys, bs4 from datetime import date # Retrieve User Input print("Search imgur: ") search = input() # Create folder destination on drive to save photos today = date.today() date = today.strftime("%b-%d-%Y") folderName = date + " - imgur download - " + search os.chdir("D:\Python\Projects - Automate the Boring Stuff with Python\Chapter 11\imgurDownloader") if not os.path.exists(".\\" + folderName): os.makedirs(folderName) print("Directory " , folderName , " created ") else: print("Directory " , folderName , " already exists") # Retrieve images from the link and save to the drive res = requests.get("https://imgur.com/search?q=" + search) res.raise_for_status() imgurSoup = bs4.BeautifulSoup(res.text, "html.parser") imgurPics = [i['href'] for i in imgurSoup.find_all('a', class_='image-list-link')] if imgurPics == []: print("No results found.") else: print("Downloading pictures...") for i in range(0,len(imgurPics)): pictureURL = "https://imgur.com" + imgurPics[i] imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb") for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close() print("Download successful")

for i in range(0,len(imgurPics)): pictureURL = "https://imgur.com" + imgurPics[i] imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb") for chunk in res.iter_content(100000): imageFile.write(chunk) imageFile.close()

1条回答

网友

1楼 · 发布于 2024-05-23 17:09:47

链接理解只是将for循环放入一行程序中。所以

imgurPics = []
for i in imgurSoup.find_all('a', class_='image-list-link'):
    imgurPics.append(i['href']

与列表理解相同

imgurPics = [i['href'] for i in imgurSoup.find_all('a', class_='image-list-link')]

2..find_all和select基本上是相同的想法。这只是select接受CSS选择器的问题。我个人对.find_all()比较满意，因为这是我首先学到的。如果您想使用select，可以使用imgurPics = [i['href'] for i in imgurSoup.select('a.image-list-link')]

您的链接是指向Galleria的链接，而不是单个图片。所以你需要拉并保存一张单独的照片。所以你真正想要的是<img alt="" src="//i.imgur.com/SZPjHwzb.jpg">，而不是href。第二，你没有读到这个回答。您有res.iter_content(100000)，但这是对res = requests.get("https://imgur.com/search?q=" + search)的引用。您需要从正在迭代的链接中获取图像
小心使用与函数/模块同名的变量。您可以使用from datetime import date，但随后使用date = today.strftime("%b-%d-%Y")。我想说的是尽量避免这种情况。我将变量更改为dateStr
循环时不需要创建索引范围（除非您正在使用或需要元素的索引，或者有一个计数器。即使这样，如果我需要带索引的索引，那么我只需要使用enumerate()），但不需要在i in range(0,len(imgurPics))上执行for循环。只要简单地做for i in imgurPics。这将在整个列表中循环

因此，请查看代码。我记下了我在哪里做的更改，以便您可以看到

import requests, os, sys, bs4
from datetime import date

# Retrieve User Input
search = input("Search imgur:\n\n") #< - made change here. You can do what you had in 1 line

# Create folder destination on drive to save photos
today = date.today()
dateStr = today.strftime("%b-%d-%Y")  #< - changed the variable name
folderName = dateStr + " - imgur download - " + search
os.chdir("D:\Python\Projects - Automate the Boring Stuff with Python\Chapter 11\imgurDownloader")
if not os.path.exists(".\\" + folderName):
    os.makedirs(folderName)
    print("Directory " , folderName ,  " created ")
else:    
    print("Directory " , folderName ,  " already exists")

# Retrieve images from the link and save to the drive

res = requests.get("https://imgur.com/search?q=" + search)
res.raise_for_status()
imgurSoup = bs4.BeautifulSoup(res.text, "html.parser")
imgurPics = [i['src'] for i in imgurSoup.find_all('img') if 'loaders' not in i['src']] #<  change here to get the img tag with the src attribute, which is the link to the image. This will possibly include an extra link that we don't want, so I added the if part as well

if imgurPics == []:
    print("No results found.")
else:
    print("Downloading pictures...")
    for pic in imgurPics:  #< - change the loop syntax slightly
        pictureURL = "https:" + pic  #< - had to change this slightly
        resPic = requests.get(pictureURL)  #< - added this like to pull the image http
        imageFile = open(os.path.join(folderName, os.path.basename(pictureURL)), "wb")
        for chunk in resPic.iter_content(100000): #< - need to point to that image http repsonse, so used the response I got 2 lines up `resPic`
            imageFile.write(chunk)
        imageFile.close()

    print("Download successful")

相关问题更多 >

编程相关推荐

热门问题

热门文章