我正在尝试使用Python中的Beauty Soup开发我的第一个web刮板。 其目的是让scraper请求用户输入,进行正常的谷歌图像搜索,并下载所有所需数量的图像。 早些时候是rg meta标记更改为rg_i Q4LuWd,对代码进行了更改。但它仍然无法抓取图像。 查找和下载图像还需要进行哪些更改。 没有发现错误或异常,程序运行但找不到图像的URL
import os
import json
import requests # to sent GET requests
from bs4 import BeautifulSoup # to parse HTML
# user can input a topic and a number
# download first n images from google image search
GOOGLE_IMAGE = \
'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&'
# The User-Agent request header contains a characteristic string
# that allows the network protocol peers to identify the application type,
# operating system, and software version of the requesting software user agent.
# needed for google search
usr_agent = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
}
SAVE_FOLDER = 'images'
def main():
if not os.path.exists(SAVE_FOLDER):
os.mkdir(SAVE_FOLDER)
download_images()
def download_images():
# ask for user input
data = input('What are you looking for? ')
n_images = int(input('How many images do you want? '))
print('Start searching...')
# get url query string
searchurl = GOOGLE_IMAGE + 'q=' + data
print(searchurl)
# request url, without usr_agent the permission gets denied
response = requests.get(searchurl, headers=usr_agent)
html = response.text
# find all divs where class='rg_i Q4LuWd'
soup = BeautifulSoup(html, 'html.parser')
results = soup.findAll('div', {'class': 'rg_i Q4LuWd'},limit=n_images)
**Earlier it was rg-meta tag that changed to rg_i Q4LuWd**
print(results)
# extract the link from the div tag
imagelinks= []
for re in results:
text = re.text # this is a valid json string
text_dict= json.loads(text) # deserialize json to a Python dict
link = text_dict['ou']
# image_type = text_dict['ity']
imagelinks.append(link)
print(f'found {len(imagelinks)} images')
print('Start downloading...')
for i, imagelink in enumerate(imagelinks):
# open image link and save as file
response = requests.get(imagelink)
imagename = SAVE_FOLDER + '/' + data + str(i+1) + '.jpg'
with open(imagename, 'wb') as file:
file.write(response.content)
print('Done')
if __name__ == '__main__':
main()
这是因为图像URL位于
<script>
标记中,为了获得它们,您需要使用regex
来匹配、提取和解码它们另外,我不确定如何控制input()
期间的输出图像量查找所有<
script>
标记:通过
regex
匹配图像数据:通过
regex
匹配所需图像(全分辨率):使用
bytes()
和decode()
提取并解码它们:要保存图像,可以使用^{} (more in-depth):
代码和full example in the online IDE刮得更多(试着慢慢读它):
或者,您也可以通过使用SerpApi中的Google Images API实现同样的功能。这是一个免费的付费API
这种情况的不同之处在于,您不必处理
regex
来匹配并从页面的源代码中提取所需的数据,相反,您只需迭代结构化JSON即可获得所需的数据要集成的代码:
顺便说一下,我写了一篇更详细的博客文章,内容是关于如何刮水
相关问题 更多 >
编程相关推荐