碎片不工作（noob级别）0个页面已爬网0个项目已爬网

C:\Users\xxx\allegro>scrapy crawl AllegroPrices 2017-12-10 22:25:14 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: AllegroPrices) 2017-12-10 22:25:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'allegro.spiders', 'SPIDER_MODULES': ['allegro.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'AllegroPrices'} 2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'allegro.middlewares.AllegroSpiderMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled item pipelines: ['allegro.pipelines.AllegroPipeline'] 2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider opened 2017-12-10 22:25:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-10 22:25:15 [AllegroPrices] INFO: Spider opened: AllegroPrices 2017-12-10 22:25:15 [scrapy.core.engine] INFO: Closing spider (finished) 2017-12-10 22:25:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 527000), 'log_count/INFO': 8, 'start_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 517000)} 2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider closed (finished)

# -*- coding: utf-8 -*- # Scrapy settings for allegro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'AllegroPrices' SPIDER_MODULES = ['allegro.spiders'] NEWSPIDER_MODULE = 'allegro.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'allegro (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { 'allegro.middlewares.AllegroSpiderMiddleware': 543, } LOG_LEVEL = 'INFO' # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'allegro.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'allegro.pipelines.AllegroPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class AllegroPipeline(object): def process_item(self, item, spider): return item

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class AllegroItem(scrapy.Item): # define the fields for your item here like: product_name = scrapy.Field() product_sale_price = scrapy.Field() product_seller = scrapy.Field()

1条回答

网友

1楼 · 发布于 2024-05-23 15:45:52

创建独立的项目和保存它没有问题的脚本。在

我不需要改变USER-AGENT。在

可能有些设置有问题。你没有把网址放到教程里去检查。在

或者只是你有错误的缩进，start_urls和{}在not inside类中。缩进在Python中非常重要。在

顺便说一句：你把/a/忘在xpath给卖家了。在

import scrapy

#class AllegroItem(scrapy.Item):
#    product_name = scrapy.Field()
#    product_sale_price = scrapy.Field()
#    product_seller = scrapy.Field()

class AllegroPrices(scrapy.Spider):

    name = "AllegroPrices"
    allowed_domains = ["allegro.pl"]

    start_urls = [
        "http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html",
        "http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html",
        "http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html",
        "http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html"
    ]

    def parse(self, response):
        title = response.xpath('//h1[@class="title"]//text()').extract()
        sale_price = response.xpath('//div[@class="price"]//text()').extract()
        seller = response.xpath('//div[@class="btn btn-default btn-user"]/a/span/text()').extract()

        title = title[0].strip()

        print(title, sale_price, seller)

        yield {'title': title, 'price': sale_price, 'seller': seller}

        #items = AllegroItem()
        #items['product_name'] = ''.join(title).strip()
        #items['product_sale_price'] = ''.join(sale_price).strip()
        #items['product_seller'] = ''.join(seller).strip()
        #yield items

#  - run it as standalone script without project and save in CSV  -

from scrapy.crawler import CrawlerProcess

#c = CrawlerProcess()

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'output.csv'
})

c.crawl(AllegroPrices)
c.start()

CSV结果：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章