如何在Scrapy中处理302重定向
我在抓取一个网站时,服务器给了我一个302的响应:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
我想直接请求获取网址,而不是被重定向。现在我找到了这个中间件:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
我把这个重定向的代码加到了我的middleware.py文件里,并且在settings.py中也做了相应的设置:
DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
但是我还是被重定向了。这样做就能让这个中间件生效吗?我是不是漏掉了什么?
6 个回答
我找到了绕过重定向的方法,步骤如下:
1- 在解析(parse)的时候,检查是否被重定向。
2- 如果被重定向了,就要安排一个方法来模拟跳过这个重定向,返回到你需要抓取的URL。你可能需要在谷歌浏览器中查看网络行为,并模拟发送一个POST请求,以便回到你的页面。
3- 进入另一个过程,使用回调(callback),然后在这个过程中完成所有的抓取工作,通过递归循环调用自己,并在最后设置一个条件来结束这个循环。
下面是我用来绕过免责声明页面并返回到我的主URL开始抓取的示例。
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in below proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it from redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected @@@@@@@@@@')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
你可以通过在settings.py文件中把REDIRECT_ENABLED
设置为False来关闭RedirectMiddleware
这个功能。
我在使用 HTTPCACHE_ENABLED = True
时遇到了无限循环重定向的问题。为了避免这个问题,我设置了 HTTPCACHE_IGNORE_HTTP_CODES = [301,302]
。
如果你在浏览器中正常加载的页面,突然返回一个不可理解的 302
响应,比如把你重定向到主页或其他固定页面,这通常意味着服务器在采取措施,防止一些不想要的活动。
你需要降低你的爬虫访问频率,或者使用智能代理(比如 Crawlera)或者代理轮换服务,并在收到这样的响应时重试你的请求。
要重试这种响应,你可以在源请求的 meta
中添加 'handle_httpstatus_list': [302]
,然后在回调中检查 response.status == 302
。如果是302,就通过 response.request.replace(dont_filter=True)
来重试你的请求。
在重试时,你的代码还应该限制对任何给定网址的最大重试次数。你可以使用一个字典来跟踪重试次数:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
根据具体情况,你可能想把这段代码移动到 下载中间件 中。
在这个情况下,别忘了中间件,这样做就可以了:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
也就是说,当你处理请求的时候,需要加上一个叫做meta的参数:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)