嗨,我正在用selenium抓取谷歌图像。但效果并不好。我怎样才能让这个代码工作?我的代码如下
以前,我用谷歌图片下载,突然卡住了。所以我在寻找一种新的方法,我希望有人能帮我谢谢你
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import json
import os
import urllib.request as urllib2
import argparse
searchterm = 'spider' # will also be the name of the folder
url = "https://www.google.co.in/search?q="+searchterm+"&source=lnms&tbm=isch"
# NEED TO DOWNLOAD CHROMEDRIVER, insert path to chromedriver inside parentheses in following line
browser = webdriver.Chrome('C:\Python27\Scripts\chromedriver')
browser.get(url)
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
counter = 0
succounter = 0
if not os.path.exists(searchterm):
os.mkdir(searchterm)
for _ in range(500):
browser.execute_script("window.scrollBy(0,10000)")
for x in browser.find_elements_by_xpath('//div[contains(@class,"rg_meta")]'):
counter = counter + 1
print("Total Count:", counter)
print("Succsessful Count:", succounter)
print("URL:",json.loads(x.get_attribute('innerHTML'))["ou"])
img = json.loads(x.get_attribute('innerHTML'))["ou"]
imgtype = json.loads(x.get_attribute('innerHTML'))["ity"]
try:
req = urllib2.Request(img, headers={'User-Agent': header})
raw_img = urllib2.urlopen(req).read()
File = open(os.path.join(searchterm , searchterm + "_" + str(counter) + "." + imgtype), "wb")
File.write(raw_img)
File.close()
succounter = succounter + 1
except:
print("can't get img")
print (succounter, "pictures succesfully downloaded")
browser.close()
我还面临着从谷歌抓取图像的问题,就像你的方法使用
rg_meta
谷歌图像搜索结果网页源代码已经更改,自2020年初以来,他们不再提供
rg_meta
rg_meta
标记也更改为随机字符串我认为谷歌已经开始禁止爬行机器人,并开始使用谷歌定制的搜索API
我决定从谷歌图片以外的其他网站抓取图片
相关问题 更多 >
编程相关推荐