如何使用不同的线程运行多个shell作业,并在同时执行每个shell作业后等待它们完成?

2024-04-25 20:07:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在编写一个需要处理大量数据的脚本。我意识到脚本的并行组件实际上对有大量单独数据点的实例没有帮助。我将创建临时文件并并行运行它们。我在qsub上运行它,因此我将通过-pe threaded $N_JOBS分配一定数量的线程(在这个小示例中是4)。你知道吗

我的最终目标是使用我分配的一个线程启动每个进程,然后等待所有作业完成后再继续。你知道吗

但是,我只使用process = subprocess.Popenprocess.communicate()运行shell作业。因为僵尸进程,我在过去使用process.wait()时遇到了一些问题。你知道吗

如何修改run函数以启动作业,而不是等待完成,然后启动下一个作业,然后在所有作业运行后,等待所有作业完成?

请让我知道,如果这是不清楚,我可以解释得更好。在下面的例子中(可能是一个可怕的例子?),我想使用4个独立的线程(我不知道如何设置b/c,我只做过joblib.Parallel简单的并行化),其中每个线程运行命令echo '$THREAD' && sleep 1。所以最后应该用1秒多一点,而不是4秒。你知道吗

我找到了这篇文章:Python threading multiple bash subprocesses?但是我不知道如何用我的run脚本来适应我的情况。你知道吗

import sys, subprocess, time 

# Number of jobs
N_JOBS=4

# Run command
def run(
    cmd,
    popen_kws=dict(),
    ):

    # Run
    f_stdout = subprocess.PIPE
    f_stderr = subprocess.PIPE

    # Execute the process
    process_ = subprocess.Popen(cmd, shell=True, stdout=f_stdout, stderr=f_stderr, **popen_kws) 
    # Wait until process is complete and return stdout/stderr
    stdout_, stderr_ = process_.communicate() # Use this .communicate instead of .wait to avoid zombie process that hangs due to defunct. Removed timeout b/c it's not available in Python 2

    # Return code
    returncode_ = process_.returncode

    return {"process":process_, "stdout":stdout_, "stderr":stderr_, "returncode":returncode_}

# Commands
cmds = list(map(lambda x:"echo '{}' && sleep 1".format(x), range(1, N_JOBS+1)))
# ["echo '1'", "echo '2'", "echo '3'", "echo '4'"]

# Start time 
start_time = time.time()
results = dict()
for thread, cmd in enumerate(cmds, start=1):
    # Run command but don't wait for it to finish (Currently, it's waiting to finish)
    results[thread] = run(cmd)

# Now wait until they are all finished
print("These jobs took {} seconds\n".format(time.time() - start_time))
print("Here's the results:", *results.items(), sep="\n")
print("\nContinue with script. .. ...")

# These jobs took 4.067937850952148 seconds

# Here's the results:
# (1, {'process': <subprocess.Popen object at 0x1320766d8>, 'stdout': b'1\n', 'stderr': b'', 'returncode': 0})
# (2, {'process': <subprocess.Popen object at 0x1320547b8>, 'stdout': b'2\n', 'stderr': b'', 'returncode': 0})
# (3, {'process': <subprocess.Popen object at 0x132076ba8>, 'stdout': b'3\n', 'stderr': b'', 'returncode': 0})
# (4, {'process': <subprocess.Popen object at 0x132076780>, 'stdout': b'4\n', 'stderr': b'', 'returncode': 0})

# Continue with script. .. ...

我尝试过遵循multiprocessinghttps://docs.python.org/3/library/multiprocessing.html上的文档,但要使其适应我的情况确实很混乱:

# Run command
def run(
    cmd,
    errors_ok=False,
    popen_kws=dict(),
    ):

    # Run
    f_stdout = subprocess.PIPE
    f_stderr = subprocess.PIPE

    # Execute the process
    process_ = subprocess.Popen(cmd, shell=True, stdout=f_stdout, stderr=f_stderr, **popen_kws) 

    return process_

# Commands
cmds = list(map(lambda x:"echo '{}' && sleep 0.5".format(x), range(1, N_JOBS+1)))
# ["echo '1'", "echo '2'", "echo '3'", "echo '4'"]

# Start time 
start_time = time.time()
results = dict()
for thread, cmd in enumerate(cmds, start=1):
    # Run command but don't wait for it to finish (Currently, it's waiting to finish)
    p = multiprocessing.Process(target=run, args=(cmd,))
    p.start()
    p.join()
    results[thread] = p

Tags: torunechocmdtimestderrstdout作业
1条回答
网友
1楼 · 发布于 2024-04-25 20:07:04

你就快到了。处理多处理的最简单方法是使用multiprocessing.Pool对象,如multiprocessing documentation的介绍所示,然后使用map()starmap()您的函数集。map()starmap()之间的最大区别在于map()假定函数采用单个参数(因此可以传递一个简单的iterable),而starmap()需要嵌套的iterable参数。你知道吗

对于您的示例,这是可行的(run()函数基本上被跳过,尽管我将签名改为命令和参数列表,因为通常将字符串传递给系统调用是个坏主意):

from multiprocessing import Pool

N_JOBS = 4

def run(cmd, *args):
    return cmd + str(args)

cmds = [
    ('echo', 'hello', 1, 3, 4),
    ('ls', '-l', '-r'),
    ('sleep', 3),
    ('pwd', '-P'),
    ('whoami',),
]

results = []
with Pool(N_JOBS) as p:
    results = p.starmap(run, cmds)

for r in results:
    print(r)

没有必要拥有与命令相同数量的作业;Pool中的子进程将根据运行函数的需要被重用。你知道吗

相关问题 更多 >