用Python抓取Google图像

import requests import re import urllib2 import os import cookielib import json def get_soup(url,header): return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser') query = raw_input("query image")# you can change the query for the image here image_type="ActiOn" query= query.split() query='+'.join(query) url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch" print url #add the directory for your image here DIR="C:\Users\mynam\Desktop\WB" header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36" } soup = get_soup(url,header) ActualImages=[]# contains the link for Large original images, type of image for a in soup.find_all("div",{"class":"rg_meta"}): link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"] ActualImages.append((link,Type)) print "there are total" , len(ActualImages),"images" if not os.path.exists(DIR): os.mkdir(DIR) DIR = os.path.join(DIR, query.split()[0]) if not os.path.exists(DIR): os.mkdir(DIR) ###print images for i , (img , Type) in enumerate(ActualImages[0:5]): try: req = urllib2.Request(img, headers={'User-Agent' : header}) raw_img = urllib2.urlopen(req).read() cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1 print cntr if len(Type)==0: f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb') else : f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb') f.write(raw_img) f.close() except Exception as e: print "could not load : "+img print e

3条回答

网友

1楼 · 编辑于 2024-05-29 05:12:35

我面临着同样的问题，找不到rg_meta 我发现了一个可以下载大约前80张图片的代码。。但不适用于滚动

extensions = { "jpg", "jpeg", "png", "gif" }

html = browser.page_source.split('["')
print(html)
imges = []
for i in html:
    if i.startswith('http') and i.split('"')[0].split('.')[-1] in extensions:
        x=i.split('"')[0]


        if(succounter>=totalcount):
            break
        counter = counter + 1
        print "Total Count:", counter
        print "Succsessful Count:", succounter
        print "URL:",x

        try:
            req = urllib2.Request(x, headers={'User-Agent': header})
            succounter = succounter + 1
            if(succounter>1000):
                break
        except:
                print "can't get img"

网友

2楼 · 编辑于 2024-05-29 05:12:35

谷歌最近似乎已从图像搜索结果中删除了元数据，即在HTML中找不到rg_meta。因此，soup.find_all("div",{"class":"rg_meta"}):不会返回任何内容

我还没有找到解决办法。我相信谷歌做出这一改变正是为了防止刮刮

网友
3楼 · 编辑于 2024-05-29 05:12:35

我没见过有人提到这件事。这不是一个理想的解决方案，但如果您想要一些简单的工作方式，并且不需要任何麻烦就可以使用selenium。正如Densus所提到的，谷歌似乎有意阻止图像抓取，这可能是硒的不当使用，我不确定

github上有很多公共的、可运行的selenium google图像刮板，您可以查看和使用它们。事实上，如果您在github上搜索任何最近的python google image scraper，我认为大部分（如果不是全部的话）都将是selenium实现

例如： https://github.com/Msalmannasir/Google_image_scraper

这一个，只需下载chromium驱动程序并在代码中更新它的文件路径

相关问题更多 >

编程相关推荐

热门问题

热门文章