我在python中使用scrapy作为爬虫。我的问题是,我不能同时启动多个爬行作业。在
GetJob公司
def getJobs(self):
mysql = MysqlConnector.Mysql()
db = mysql.getConnection();
cur = db.cursor();
cur.execute("SELECT * FROM job WHERE status=0 OR days>0")
print "Get new jobs"
#JobModel
joblist=[]
for row in cur.fetchall():
job = JobModel.JobModel();
job.id = row[0]
job.user_id = row[1]
job.name = row[2]
job.url = row[3]
job.api = row[4]
job.max_pages = row[5]
job.crawl_depth = row[6]
job.processing_patterns = row[7]
job.status = row[8]
job.days = row[9]
job.ajax=row[11]
joblist.append(job);
#Proces the job now
for job in joblist:
processJob = ProcessJob.ProcessJob();
th=Thread(target=processJob.processJob,args=(job,))
th.daemon=True
th.start();
db.close()
处理作业
^{pr2}$Get Jobs每隔5秒从数据库中检索新作业并将其提供给processJobs。问题是,当我启动多个爬网作业时,会出现以下异常:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/Users/fabianlurz/c_crawler/c_crawler/jobs/ProcessJob.py", line 31, in processJob
reactor.run(0)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 682, in startRunning
raise error.ReactorAlreadyRunning()
我已经知道我不能启动reactor两次-但是必须有一种方法在一个“服务器”上有多个爬行实例。 那我怎么才能做到呢?在
让它正常工作
使用billard生成多个进程
相关问题 更多 >
编程相关推荐