调用Merge()和Disconnect()时mpi4py冻结

2024-04-25 18:49:33 发布

您现在位置:Python中文网/ 问答频道 /正文

为什么在CentOS 7上尝试使用mpi4py时Merge()Disconnect()会冻结?我使用的是python2.7.5、mpi4py2.0.0,我必须加载openmpi/gnu/1.8.8模块。你知道吗

在CentOS 6下我很难做到这一点,唯一适合我的MPI版本是openmpi/gnu/1.6.5。不幸的是,我在CentOS 7的yum存储库中没有看到这个版本。你知道吗

有没有办法追踪mpi4py或MPI中发生的事情?有没有办法在CentOS 7上获得旧版本的MPI?你知道吗

下面是我尝试运行的代码:

# mpi_spawn_test.py
import sys

from time import sleep
from mpi4py import MPI

WORKER_COMMAND = 'worker'
SHOULD_MERGE = False
SHOULD_DISCONNECT = False

def main():
    command = len(sys.argv) > 1 and sys.argv[1] or '1'
    if command != WORKER_COMMAND:
        worker_count = int(command)
        print('launching {} workers.'.format(worker_count))
        comm = MPI.COMM_SELF.Spawn(sys.executable,
                                   args=[sys.argv[0], WORKER_COMMAND],
                                   maxprocs=worker_count)
        print('launched workers.')
        if SHOULD_MERGE:
            comm = comm.Merge()
            print("Merged workers.")
        for i in range(worker_count):
            msg = comm.recv(source=MPI.ANY_SOURCE)
            print("Manager received {}.".format(msg))
        print("Manager finished with fleet size {}.".format(comm.Get_size()))
    else:
        print('worker launched.')
        comm = MPI.Comm.Get_parent()
        print("Got parent.")
        if SHOULD_MERGE:
            comm = comm.Merge()
            print("Merged parent.")
        size = comm.Get_size()
        rank = comm.Get_rank()
        comm.send(rank, dest=0)

        print("Finished worker: rank {} of {}".format(rank, size))

    if SHOULD_DISCONNECT:
        comm.Disconnect()
        print("Finished with command {}.".format(command))

main()

我用这个命令启动它:

mpiexec -n 1 python mpi_spawn_test.py 3

然后我看到这个输出:

launching 3 workers.
launched workers.
worker launched.
Got parent.
Finished worker: rank 1 of 3
Manager received 1.
worker launched.
Got parent.
worker launched.
Got parent.
Finished worker: rank 2 of 3
Manager received 0.
Finished worker: rank 0 of 3
Manager received 2.
Manager finished with fleet size 1.

如果我将SHOULD_DISCONNECT设置为True,我会看到一两条“Finished with command worker.”消息,那么进程将冻结。你知道吗

如果我将SHOULD_MERGE设置为True,我将看到“launched workers”和“Got parent”消息,那么进程将冻结。你知道吗

我从MPI debugging page中得到了一些提示,但我并不真正理解调试输出。举个例子,我试过一次发布会:

mpiexec -mca btl_base_verbose 1 -mca state_base_verbose 1 -n 1 python mpi_spawn_test.py 3

详细输出如下:

[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136219] mca: bml: Using self btl to [[12091,1],0] on node octomore
launching 3 workers.
[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136221] mca: bml: Using self btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using self btl to [[12091,2],1] on node octomore
[octomore:136223] mca: bml: Using self btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
launched workers.
worker launched.
Got parent.
worker launched.
Got parent.
worker launched.
Got parent.
^C[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446

Tags: tonodebaseonjobattcpactivate