我用cron-job和mongodb运行一个scrapy蜘蛛来抓取网站。当我运行一个常规的scrapy craw时,它可以工作并保存到mongodb。但是,当我使用cron运行它时,它不会保存到数据库中。日志输出显示常规的爬网结果,只是不保存到mongodb。我错过了什么?我的猜测是关于scrapy的环境,因为我可以在单个spider中使用mongo save()并获得所需的结果,但当我将其放入管道中时却不能。在
谢谢!在
**crontab -e**
PATH=/home/ubuntu/crawlers/env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
*/15 * * * * /home/ubuntu/crawlers/env/bin/python3 /home/ubuntu/crawlers/spider/evilscrapy/evilscrapy/run.py > /tmp/output
**pipeline**
class EvilscrapyPipeline(object):
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self,item,spider):
self.log_record(item)
print(item)
if item['url']:
if self.collection.find( { "url": item['url'] } ).count() == 0:
if item['title']:
if item['content']:
item['timestamp']=datetime.datetime.now()
self.collection.insert(item)
return item
运行'/home/ubuntu/crawlers/env/bin/python3/home/ubuntu/crawlers/spider/evirscrapy/evirscrapy的输出差异/运行.py>;/tmp/output'在我的终端vs cron job上显示进程不会通过mongo db命令运行。在
具体来说,在link_spider内部,日志在mongodb调用后停止:
^{pr2}$日志似乎到此为止。在
我的mongo永connector文件:
import json
import os
import sys
from pymongo import MongoClient
from scrapy.conf import settings
def check_mongo(url):
connection = MongoClient()
db = connection[settings['MONGODB_DB']]
collection = db[settings['MONGODB_COLLECTION']]
if collection.find( { "url": url } ).count() != 0:
return False
else:
return True
和设置:
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = 'articles'
MONGODB_COLLECTION = 'articles_data'
在蒙古德.log公司名称:
2017-05-01T21:12:40.926+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] MongoDB starting : pid=4249 port=27017 dbpath=/var/lib/mongodb 64-bit host=ubuntu
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] db version v3.2.12
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] git version: ef3e1bc78e997f0d9f22f45aeb1d8e3b6ac14a14
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] allocator: tcmalloc
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] modules: none
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] build environment:
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] distmod: ubuntu1604
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] distarch: x86_64
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] target_arch: x86_64
2017-05-01T21:12:40.932+0000 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1", port: 27017 }, storage: { dbPath: "/var/lib/mongo$
2017-05-01T21:12:40.961+0000 I - [initandlisten] Detected data files in /var/lib/mongodb created by the 'wiredTiger' storage engine, so setting the active storage en$
2017-05-01T21:12:40.961+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=4G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics$
2017-05-01T21:12:41.300+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/diagnostic.data'
2017-05-01T21:12:41.300+0000 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2017-05-01T21:12:41.301+0000 I NETWORK [initandlisten] waiting for connections on port 27017
2017-05-02T19:52:06.590+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T19:52:06.590+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:08:58.458+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:08:58.458+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:08:58.458+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:21:39.076+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:21:39.076+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:21:39.076+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T20:21:39.076+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T21:33:09.651+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T21:33:09.651+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T21:33:09.651+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T21:33:09.651+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T22:01:53.036+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T22:01:53.036+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T22:01:53.036+0000 I COMMAND [conn46674] killcursors: found 0 of 1
2017-05-02T22:01:53.036+0000 I COMMAND [conn46674] killcursors: found 0 of 1
你是对的,crontab启动的进程有自己的最小环境。当启动依赖于特定环境变量的复杂过程时,这通常会导致问题。在
要修复此问题,请尝试添加。$HOME/.profile在crontab中的命令前面。例如:
相关问题 更多 >
编程相关推荐