不屑一顾机器人.txt而不是解析pag

2024-04-23 13:10:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用与我在one of the answers中找到的相似的方式抓取{a1}(请看更新答案)。我稍微修改了代码,以便删除不推荐使用的内容。在

一开始我遇到了robots.txt限制我的问题,所以我发现我可以设置ROBOTSTXT_OBEY=False设置.py实际上,它似乎忽略了它,但由于某些原因,spider不再使用parse方法。在

这是我的蜘蛛

from scrapy.http import Request, FormRequest
from scrapy.item import Item, Field
from scrapy.spiders import Spider
import logging

class AcrisItem(Item):
    borough = Field()
    block = Field()


class AcrisSpider(Spider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['https://a836-acris.nyc.gov/DS/DocumentSearch/PartyName']

    def start_requests(self):
        return [(Request(url, meta={'dont_redirect': True}, callback=self.parse)) for url in self.start_urls]

    def parse(self, response):
        form_token = response.selector.xpath('//input[@name="__RequestVerificationToken"]/@value').extract_first()

        logging.debug('THE FORM TOKEN IS: %s\n\n' % form_token)

        formdata = {
            "__RequestVerificationToken": form_token,
            "hid_last": "SMITH",
            "hid_first": "JOHN",
            "hid_ml": "",
            "hid_suffix": "",
            "hid_business": "",
            "hid_selectdate": "To Current Date",
            "hid_datefromm": "",
            "hid_datefromd": "",
            "hid_datefromy": "",
            "hid_datetom": "",
            "hid_datetod": "",
            "hid_datetoy": "",
            "hid_partype": "",
            "hid_borough": "All Boroughs/Counties",
            "hid_doctype": "All Document Classes",
            "hid_max_rows": "10",
            "hid_page": "1",
            "hid_partype_name": "All Parties",
            "hid_doctype_name": "All Document Classes",
            "hid_borough_name": "All Boroughs/Counties",
            "hid_ReqID": "",
            "hid_SearchType": "PARTYNAME",
            "hid_ISIntranet": "N",
            "hid_sort": ""
        }

        if form_token:
            yield FormRequest(url="https://a836-acris.nyc.gov/DS/DocumentSearch/PartyNameResult",
                              method="POST",
                              formdata=formdata,
                              meta={'dont_redirect': True},
                              callback=self.parse_page)

    def parse_page(self, response):
        rows = response.selector.xpath('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tbody/tr')

        for row in rows:
            item = AcrisItem()

            borough = row.xpath('.//td[3]/div/font/text()').extract_first()
            block = row.xpath('.//td[4]/div/font/text()').extract_first()

            if borough and block:
                item['borough'] = borough
                item['block'] = block

                yield item

这是输出(减去init消息)

^{pr2}$

Tags: nameimportselfformtokenparseresponseall
1条回答
网友
1楼 · 发布于 2024-04-23 13:10:13

禁止在start_requests上直接重定向。因此301应该到达parse方法,不幸的是您没有允许此响应状态代码通过。在

允许它具有以下属性:

class AcrisSpider(Spider):
    ...
    handle_httpstatus_list = [301, 302]
    ...

或者在请求中传递handle_httpstatus_all=True元参数,如:

^{pr2}$

相关问题 更多 >