crawler中没有启动回调函数,scrapy

2024-05-16 14:23:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要使用我的函数parsePage作为回调来请求从网站上爬网的链接。但是,请求只发送到第一个链接一次,我没有得到任何响应。你知道吗

这是我的密码:

class diploma(CrawlSpider):
name = "diploma"
allowed_domains="pikabu.ru"
start_urls = [
    "https://pikabu.ru/hot"
]
def parse(self, response):
    for sel in response.xpath("//div[@class='stories-feed__container']/article[@class='story']"):
        item = DiplomaItem()
        item['MainPageUrl'] = "https://pikabu.ru"+sel.xpath('div[2]/header[@class="story__header"]/h2/a/@href').extract()[0]

        request = scrapy.Request(item['MainPageUrl'], callback=self.parsePage)
        request.meta['item'] = item
        yield request


def parsePage(self, response):
    print("hHAHAHAHAH")
    item = response.meta['item']
    return item

以下是日志:

    2018-03-15 18:11:26 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: diploma)
2018-03-15 18:11:26 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.10 (default, Feb  7 2017, 00:08:15) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Darwin-16.7.0-x86_64-i386-64bit
2018-03-15 18:11:26 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'diploma.spiders', 'SPIDER_MODULES': ['diploma.spiders'], 'CONCURRENT_REQUESTS': 250, 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'diploma'}
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-15 18:11:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-15 18:11:26 [scrapy.core.engine] INFO: Spider opened
2018-03-15 18:11:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-15 18:11:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-15 18:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pikabu.ru/hot> (referer: None)
~~~
/story/kak_pogoda_50_na_50_5777191
2018-03-15 18:11:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'pikabu.ru': <GET https://pikabu.ru/story/kak_pogoda_50_na_50_5777191>
~~~
/story/chto_mozhet_poyti_ne_tak_5773824
~~~
/story/strannyiy_chelovek_5777133
~~~
/story/kak_ya_zabiral_ayfon_s_pochtyi_rossii_ili_khitryie_kitaytsyi_5776835
~~~
/story/kopirayterskie_slozhnosti_ch14_5776220
~~~
/story/novyiy_televizor_samsung_mozhet_slivatsya_s_poverkhnostyu_5775567
~~~
/story/neobyichnyiy_vkhod_v_podezd_5767500
~~~
/story/muzhchina_khotel_brosit_rabotu_chtobyi_ukhazhivat_za_bolnyim_rakom_syinom_no_kollegi_otrabotali_za_nego_3300_chasov_5770070
~~~
/story/kak_ya_uchilsya_khodit_5776376
~~~
/story/zabavnoe_dialogi_s_zakazchikami_5_5777655
~~~
/story/pro_metallurga_iz_magnitogorska_snyali_yepichnuyu_korotkometrazhku_5774307
~~~
/story/lovkost_ruk_i_nikakogo_moshennichestva_5777007
~~~
/story/kogda_nashelsya_novyiy_sponsor_5769282
~~~
/story/nikto_ne_chitaet_kharakteristiki_5771821
~~~
/story/posmotrite_na_yeti_shedevryi_5777462
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-15 18:11:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 39452,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 15, 12, 11, 27, 860712),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'memusage/max': 46387200,
 'memusage/startup': 46387200,
 'offsite/domains': 1,
 'offsite/filtered': 21,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 3, 15, 12, 11, 26, 740826)}
2018-03-15 18:11:27 [scrapy.core.engine] INFO: Spider closed (finished)

如您所见,在请求回调函数parsePage之后,它不会被调用。同样在日志中,我们可以看到大约有20个链接(打印没有在代码中显示),但是请求只发送到第一个链接,而且只发送一次。为什么?你知道吗


Tags: inforesponserequestrucountdownloaderextensionsitem
1条回答
网友
1楼 · 发布于 2024-05-16 14:23:22

将此添加到代码中

allowed_domains = ["pikabu.ru"]

有关详细信息,请阅读this

对于你的链接,尝试这样做,它比你做的更好

link = urljoin('pikabu.ru',link)

有关mor信息,请阅读this

把这个加到你的请求里

dont_filter = True 

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

相关问题 更多 >