Sun Grid引擎上的ipcluster只有0级

2024-05-21 00:08:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我设置了一个IPython parallelipcluster来使用Sun Grid引擎,一切看起来都很好:

ipcluster start -n 100 --profile=sge

2016-07-15 14:47:09.749 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-07-15 14:47:09.751 [IPClusterStart] Creating pid file: /home/USERNAME/.ipython/profile_sge/pid/ipcluster.pid
2016-07-15 14:47:09.751 [IPClusterStart] Starting Controller with SGEControllerLauncher
2016-07-15 14:47:09.789 [IPClusterStart] Job submitted with job id: u'6354583'
2016-07-15 14:47:10.790 [IPClusterStart] Starting 100 Engines with SGEEngineSetLauncher
2016-07-15 14:47:10.826 [IPClusterStart] Job submitted with job id: u'6354584'
2016-07-15 14:47:40.856 [IPClusterStart] Engines appear to have started successfully

然后我使用

rc = ipp.Client(profile='sge')

但是当我使用平行魔法

%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
rank = comm.Get_rank()

print('I am #{} of {} and run on {}'.format(rank,nprocs,MPI.Get_processor_name()))

I所有进程只返回rank 0

[stdout:0] I am #0 of 1 and run on compute-8-13.local
[stdout:1] I am #0 of 1 and run on compute-8-13.local
[stdout:2] I am #0 of 1 and run on compute-3-3.local
[stdout:3] I am #0 of 1 and run on compute-3-3.local
[stdout:4] I am #0 of 1 and run on compute-3-3.local
...

以下是我的安装脚本:


  • ipcluster_config.py

    c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
    c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
    c.SlurmEngineSetLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.engine.template'
    c.SlurmControllerLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.controller.template'
    
  • ipcontroller_config.py

    c.HubFactory.ip = '*'
    
  • sge.controller.template

    # /bin/sh
    #$ -S /bin/sh
    #$ -pe orte 1
    #$ -q sThC.q
    #$ -cwd
    #$ -N ipyparallel_controller
    #$ -o ipyparallel_controller.log
    #$ -e ipyparallel_controller.err
    module load gcc/5.3/openmpi 
    source activate parallel
    ipcontroller --profile-dir={profile_dir}
    
  • sge.engine.template

    # /bin/sh
    #$ -S /bin/sh
    #$ -pe orte {n}
    #$ -q sThC.q
    #$ -cwd
    #$ -N ipyparallel_engines
    #$ -o ipyparallel_engines.log
    #$ -e ipyparallel_engines.err
    
    module load gcc/5.3/openmpi
    source activate parallel
    mpiexec -n {n} ipengine --profile-dir={profile_dir} --timeout=30
    

Tags: andofrunonlocalwithstdouttemplate
1条回答
网友
1楼 · 发布于 2024-05-21 00:08:47

我自己找到了解决方案/错误:

ipcluster_config.py中,我忘了重命名Slurm->;SGE的一些情况,所以应该是这样

c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'
c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'
c.SGEEngineSetLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.engine.template'
c.SGEControllerLauncher.batch_template_file = '/home/USERNAME/.ipython/profile_sge/sge.controller.template'

这导致ipcluster使用某种默认的SGE模板,该模板提交了100个单独的作业,而不是一个包含100个进程的作业。你知道吗

现在我如愿以偿:

[stdout:0] I am #5 of 100 and run on compute-5-17.local
[stdout:1] I am #9 of 100 and run on compute-5-17.local
[stdout:2] I am #1 of 100 and run on compute-5-17.local
[stdout:3] I am #7 of 100 and run on compute-5-17.local
[stdout:4] I am #2 of 100 and run on compute-5-17.local
...

相关问题 更多 >