如何使用Python获取并下载所有图像链接
这是我的代码
from bs4 import BeautifulSoup
import urllib.request
import re
print("Enter the link \n")
link = input()
url = urllib.request.urlopen(link)
content = url.read()
soup = BeautifulSoup(content)
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.jpg'))]
print (len(links))
#print (links)
print("\n".join(links))
当我输入以下内容时
http://keralapals.com/emmanuel-malayalam-movie-stills
我得到的输出是
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-0.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-1.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-2.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-3.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-4.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-5.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-6.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-7.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-8.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-9.jpg
http://keralapals.com/wp-content/uploads/2013/01/Emmanuel-malayalam-movie-mammootty-photos-pics-wallpapers-10.jpg
但是,当我输入以下内容时
http://www.raagalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
or
http://www.ragalahari.com/actress/13192/regina-cassandra-at-big-green-ganesha-2014.aspx
它没有任何输出 :(
所以,我需要获取原始图片的链接。这个页面只包含缩略图。当我们点击这些缩略图时,就能得到原始图片的链接。我需要获取这些图片链接并下载 :(
任何帮助都非常欢迎 .. :)
谢谢你
Muneeb K
1 个回答
3
问题在于,在第二种情况下,实际的图片网址是以 .jpg
结尾的,这些网址放在了 img
标签的 src
属性里面:
<a href="/actress/13192/regina-cassandra-at-big-green-ganesha-2014/image61.aspx">
<img src="http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg" alt="Regina Cassandra" title="Regina Cassandra at BIG Green Ganesha 2014">
</a>
作为一种选择,你也可以支持这种类型的链接:
links = [a['href'] for a in soup.find_all('a', href=re.compile('http.*\.jpg'))]
imgs = [img['src'] for img in soup.find_all('img', src=lambda x: x.endswith('.jpg'))]
links += imgs
print (len(links))
print("\n".join(links))
对于 这个网址,它会打印:
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha61t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha105t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha106t.jpg
http://imgcdn.raagalahari.com/aug2014/starzone/regina-big-green-ganesha/regina-big-green-ganesha107t.jpg
...
注意,我没有使用普通的正则表达式,而是传递了 一个函数,在这个函数里我检查 src
属性是否以 .jpg
结尾。
希望这对你有帮助,也希望你今天学到了一些新东西。