生产中的废料图像问题

2024-04-19 14:13:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个破烂的脚本从一个网站下载图片。本地工作完美,似乎也在生产服务器上,但尽管没有收到任何错误,但不要保存图像。在

这是生产服务器上的输出:

2013-07-10 05:12:33+0200 [scrapy] INFO: Scrapy 0.16.5 started (bot: mybot)
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Enabled item pipelines: CustomImagesPipeline
2013-07-10 05:12:33+0200 [bh] INFO: Spider opened
2013-07-10 05:12:33+0200 [bh] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-10 05:12:33+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-10 05:12:34+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/find/brands.jsp> (referer: None)
2013-07-10 05:12:37+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366> (referer: http://www.mysite.com/find/brands.jsp)
2013-07-10 05:12:41+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/browse/Accessories-for-Camcorders/ci/5766/N/4232860347> (referer: http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366)
2013-07-10 05:12:44+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316> (referer: http://www.mysite.com/c/browse/BrandName/ci/5732/N/4232860366)
2013-07-10 05:12:46+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/images/images500x500/927001.jpg> (referer: None)
2013-07-10 05:12:46+0200 [bh] DEBUG: Image (downloaded): Downloaded image from <GET http://www.mysite.com/images/images500x500/927001.jpg> referred in <None>
2013-07-10 05:12:46+0200 [bh] DEBUG: Scraped from <200 http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316>
    {'code': u'RFE234',
     'image_urls': u'http://www.mysite.com/images/images500x500/927001.jpg',
     'images': []}
2013-07-10 05:12:50+0200 [bh] DEBUG: Crawled (200) <GET http://www.mysite.com/images/images500x500/896290.jpg> (referer: None)
2013-07-10 05:12:50+0200 [bh] DEBUG: Image (downloaded): Downloaded image from <GET http://www.mysite.com/images/images500x500/896290.jpg> referred in <None>
2013-07-10 05:12:50+0200 [bh] DEBUG: Scraped from <200 http://www.mysite.com/c/buy/CategoryName/ci/5786/N/4232860316>
    {'code': u'ABCD123',
     'image_urls': u'http://www.mysite.com/images/images500x500/896290.jpg',
     'images': []}
2013-07-10 05:13:18+0200 [bh] INFO: Closing spider (finished)
2013-07-10 05:13:18+0200 [bh] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 11107,
     'downloader/request_count': 14,
     'downloader/request_method_count/GET': 14,
     'downloader/response_bytes': 527125,
     'downloader/response_count': 14,
     'downloader/response_status_count/200': 14,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 10, 3, 13, 18, 673536),
     'image_count': 10,
     'image_status_count/downloaded': 10,
     'item_scraped_count': 10,
     'log_count/DEBUG': 40,
     'log_count/INFO': 4,
     'request_depth_max': 2,
     'response_received_count': 14,
     'scheduler/dequeued': 4,
     'scheduler/dequeued/memory': 4,
     'scheduler/enqueued': 4,
     'scheduler/enqueued/memory': 4,
     'start_time': datetime.datetime(2013, 7, 10, 3, 12, 33, 367609)}
2013-07-10 05:13:18+0200 [bh] INFO: Spider closed (finished)

我注意到的区别是,我的项上的“images”变量是一个空列表[],而在local中通常是这样的:

^{pr2}$

主要问题是输出没有错误,因此不知道如何解决问题。

我更新了PIL,同样的0.16.5版本和python2.7版本

更新1

...
2013-07-10 06:48:50+0200 [scrapy] DEBUG: This is a DEBUG on CustomImagesPipeline !!
...

更新2 我创建了CustomImagesPipeline来保存带有产品代码的图像作为文件名。我从ImagesPipeline复制了代码,只做了一些更改。在

from scrapy import log
from twisted.internet import defer, threads
from scrapy.http import Request
from cStringIO import StringIO
from PIL import Image
import time
from scrapy.contrib.pipeline.images import ImagesPipeline


class CustomImagesPipeline(ImagesPipeline):

    def image_key(self, url, image_name):
        path = 'bh/%s.jpg' % image_name
        return path

    def get_media_requests(self, item, info):
        log.msg("This is a DEBUG on CustomImagesPipeline !! ", level=log.DEBUG)
        yield Request(item['image_urls'], meta=dict(image_name=item['code']))

    def get_images(self, response, request, info):
        key = self.image_key(request.url, request.meta.get('image_name'))
        orig_image = Image.open(StringIO(response.body))

        width, height = orig_image.size
        if width < self.MIN_WIDTH or height < self.MIN_HEIGHT:
            raise ImageException("Image too small (%dx%d < %dx%d)" % (width, height, self.MIN_WIDTH, self.MIN_HEIGHT))

        image, buf = self.convert_image(orig_image)
        yield key, image, buf

        for thumb_id, size in self.THUMBS.iteritems():
            thumb_key = self.thumb_key(request.url, thumb_id)
            thumb_image, thumb_buf = self.convert_image(image, size)
            yield thumb_key, thumb_image, thumb_buf

    def media_downloaded(self, response, request, info):
        referer = request.headers.get('Referer')

        if response.status != 200:
            log.msg(format='Image (code: %(status)s): Error downloading image from %(request)s referred in <%(referer)s>',
                    level=log.WARNING, spider=info.spider,
                    status=response.status, request=request, referer=referer)
            raise ImageException('download-error')

        if not response.body:
            log.msg(format='Image (empty-content): Empty image from %(request)s referred in <%(referer)s>: no-content',
                    level=log.WARNING, spider=info.spider,
                    request=request, referer=referer)
            raise ImageException('empty-content')

        status = 'cached' if 'cached' in response.flags else 'downloaded'
        log.msg(format='Image (%(status)s): Downloaded image from %(request)s referred in <%(referer)s>',
                level=log.DEBUG, spider=info.spider,
                status=status, request=request, referer=referer)
        self.inc_stats(info.spider, status)

        try:
            key = self.image_key(request.url, request.meta.get('image_name'))
            checksum = self.image_downloaded(response, request, info)
        except ImageException as exc:
            whyfmt = 'Image (error): Error processing image from %(request)s referred in <%(referer)s>: %(errormsg)s'
            log.msg(format=whyfmt, level=log.WARNING, spider=info.spider,
                    request=request, referer=referer, errormsg=str(exc))
            raise
        except Exception as exc:
            whyfmt = 'Image (unknown-error): Error processing image from %(request)s referred in <%(referer)s>'
            log.err(None, whyfmt % {'request': request, 'referer': referer}, spider=info.spider)
            raise ImageException(str(exc))

        return {'url': request.url, 'path': key, 'checksum': checksum}

    def media_to_download(self, request, info):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.EXPIRES:
                return  # returning None force download

            referer = request.headers.get('Referer')
            log.msg(format='Image (uptodate): Downloaded %(medianame)s from %(request)s referred in <%(referer)s>',
                    level=log.DEBUG, spider=info.spider,
                    medianame=self.MEDIA_NAME, request=request, referer=referer)
            self.inc_stats(info.spider, 'uptodate')

            checksum = result.get('checksum', None)
            return {'url': request.url, 'path': key, 'checksum': checksum}

        key = self.image_key(request.url, request.meta.get('image_name'))
        dfd = defer.maybeDeferred(self.store.stat_image, key, info)
        dfd.addCallbacks(_onsuccess, lambda _: None)
        dfd.addErrback(log.err, self.__class__.__name__ + '.store.stat_image')
        return dfd

本地系统Mac OSX,生产服务器Debian GNU/Linux 7(wheezy)


Tags: keyfromdebugimageselfinfocomlog
2条回答

docs

The images in the list of the images field will retain the same order of the original image_urls field. If some image failed downloading, an error will be logged and the image won’t be present in the images field.

它似乎是logging must be explicitly enabled,例如:

from scrapy import log
log.msg("This is a warning", level=log.WARNING)

因此,请启用日志记录,并编辑您的问题,以包括您得到的错误。在

我们已经解决了重新安装此软件包的问题:

apt-get install python
apt-get install python-scrapy
apt-get install python-openssl
apt-get install python-pip
# Remove scrapy because probably was corrupted 
apt-get remove python-scrapy
apt-get install python-dev
pip install Scrapy

谢谢!在

相关问题 更多 >