如何在Scrapy中处理302重定向

19 投票
6 回答
37448 浏览
提问于 2025-04-18 00:54

我在抓取一个网站时,服务器给了我一个302的响应:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

我想直接请求获取网址,而不是被重定向。现在我找到了这个中间件:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

我把这个重定向的代码加到了我的middleware.py文件里,并且在settings.py中也做了相应的设置:

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

但是我还是被重定向了。这样做就能让这个中间件生效吗?我是不是漏掉了什么?

6 个回答

2

我找到了绕过重定向的方法,步骤如下:

1- 在解析(parse)的时候,检查是否被重定向。

2- 如果被重定向了,就要安排一个方法来模拟跳过这个重定向,返回到你需要抓取的URL。你可能需要在谷歌浏览器中查看网络行为,并模拟发送一个POST请求,以便回到你的页面。

3- 进入另一个过程,使用回调(callback),然后在这个过程中完成所有的抓取工作,通过递归循环调用自己,并在最后设置一个条件来结束这个循环。

下面是我用来绕过免责声明页面并返回到我的主URL开始抓取的示例。

from scrapy.http import FormRequest
import requests


class ScrapeClass(scrapy.Spider):

name = 'terrascan'

page_number = 0


start_urls = [
    Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
              ]


def parse(self, response):

    ''' Here I killed Disclaimer page and continued in below proc with follow !!!'''

    # Get Currently Requested URL
    current_url = response.request.url

    # Get All Followed Redirect URLs
    redirect_url_list = response.request.meta.get('redirect_urls')
    # Get First URL Followed by Spiders
    redirect_url_list = response.request.meta.get('redirect_urls')[0]

    # handle redirection as below  ( check redirection !! , got it from redirect.py
    # in \downloadermiddlewares  Folder

    allowed_status = (301, 302, 303, 307, 308)
    if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
        
        print(current_url, '<========= am not redirected @@@@@@@@@@')
    else:
       
        print(current_url, '<====== kill that please %%%%%%%%%%%%%')
        
        session_requests = requests.session()


        # got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'

        headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',

                    'ctl00$cphContent$btnAgree': 'I Agree'
                    }
        # headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}

        # Post_ = session_requests.post(current_url, headers=headers_)
        Post_ = session_requests.post(current_url, headers=headers_)

        # if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')

        print(response.url , '<========= check this please')



        return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)



def parse_After_disclaimer(self, response):

    print(response.status)
    print(response.url)

    # put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection 

    if response.url not in [your lis of URLs]:
        print('I am here brother')
        yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)

    else:
      
        # here you are good to go for scraping work          
        items = TerrascanItem()

        all_td_tags = response.css('td')
        print(len(all_td_tags),'all_td_results',response.url)

        # for tr_ in all_tr_tags:
        parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
        Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()

     
        if parcel_No:items['parcel_No'] = parcel_No
        else: items['parcel_No'] =''


        yield items

    # Here you put the condition to recursive call of this process again
    
    #
    ScrapeClass.page_number += 1
    # next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
    next_page = Your URLS[ScrapeClass.page_number]
    print('am in page #', ScrapeClass.page_number, '===', next_page)
    if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1:  # 20
        # print('I am loooooooooooooooooooooooping again')
        yield response.follow(next_page, callback=self.parse_After_disclaimer)
2

你可以通过在settings.py文件中把REDIRECT_ENABLED设置为False来关闭RedirectMiddleware这个功能。

2

我在使用 HTTPCACHE_ENABLED = True 时遇到了无限循环重定向的问题。为了避免这个问题,我设置了 HTTPCACHE_IGNORE_HTTP_CODES = [301,302]

6

如果你在浏览器中正常加载的页面,突然返回一个不可理解的 302 响应,比如把你重定向到主页或其他固定页面,这通常意味着服务器在采取措施,防止一些不想要的活动。

你需要降低你的爬虫访问频率,或者使用智能代理(比如 Crawlera)或者代理轮换服务,并在收到这样的响应时重试你的请求。

要重试这种响应,你可以在源请求的 meta 中添加 'handle_httpstatus_list': [302],然后在回调中检查 response.status == 302。如果是302,就通过 response.request.replace(dont_filter=True) 来重试你的请求。

在重试时,你的代码还应该限制对任何给定网址的最大重试次数。你可以使用一个字典来跟踪重试次数:

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

根据具体情况,你可能想把这段代码移动到 下载中间件 中。

17

在这个情况下,别忘了中间件,这样做就可以了:

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

也就是说,当你处理请求的时候,需要加上一个叫做meta的参数:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)

撰写回答