我不明白为什么我的管道不保存文件。代码如下:
VIDEOS_DIR = '/home/dmitry/videos'
class VideoDownloadPipeline(MediaPipeline):
def get_media_requests(self, item, info):
return Request(item['file'], meta={'item': item})
def media_downloaded(self, response, request, info):
item = response.meta.get('item')
video = response.body
video_basename = item['file'].split('/')[-1]
new_filename = os.path.join(VIDEOS_DIR, video_basename)
f = open(new_filename, 'wb')
f.write(video)
f.close()
def item_completed(self, results, item, info):
item['file'] = item['file'].split('/')[-1]
return item
在此之前,我有一些其他代码,但它不是并发的,所以我必须先等待每个视频下载后再继续解析:
^{pr2}$这是我的settings.py
:
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
BOT_NAME = 'videos_parser'
SPIDER_MODULES = ['videos_parser.spiders']
NEWSPIDER_MODULE = 'videos_parser.spiders'
ITEM_PIPELINES = {
'videos_parser.pipelines.VideoFileSizePipeline': 300,
'videos_parser.pipelines.VideoExistingInDBPipeline': 350,
'videos_parser.pipelines.VideoModeratePipeline': 400,
'videos_parser.pipelines.VideoDownloadPipeline': 500,
'videos_parser.pipelines.JsonWriterPipeline': 800,
}
EXTENSIONS = {
'scrapy.contrib.closespider.CloseSpider': 100,
}
CLOSESPIDER_ITEMCOUNT = 50
DOWNLOAD_TIMEOUT = 60
更新
我添加了一些log.msg()
语句,如get_media_requests
和media_downloaded
中的语句,正如我所见,get_media_requests
被调用,media_download
不是因为:
2014-07-23 08:58:20+0400 [xhamster] DEBUG: Retrying <GET http://somesite/video.mp4> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
但我可以用浏览器下载这个文件。在
我只是漏掉了一行信息,上面说文件是由于
DOWNLOAD_TIMEOUT
而被蜘蛛丢弃的。在相关问题 更多 >
编程相关推荐