如何在ipyparallel客户端和远程引擎之间最好地共享静态数据？

2条回答

网友

1楼 · 编辑于 2024-06-01 01:14:53

有时您需要按类别分散数据分组，以便确保每个子组都完全包含在单个集群中。在

我通常是这样做的：

# Connect to the clusters
import ipyparallel as ipp
client = ipp.Client()
lview  = client.load_balanced_view()
lview.block = True
CORES = len(client[:])

# Define the scatter_by function
def scatter_by(df,grouper,name='df'):
    sz = df.groupby([grouper]).size().sort_values().index.unique()
    for core in range(CORES):
        ids = sz[core::CORES]
        print("Pushing {0} {1}s into cluster {2}...".format(size(ids),grouper,core))
        client[core].push({name:df[df[grouper].isin(ids)]})

# Scatter the dataframe df grouping by `year`
scatter_by(df,'year')

请注意，我建议的scatters函数可以确保每个簇将承载相似数量的观测，这通常是一个好主意。在

网友

2楼 · 编辑于 2024-06-01 01:14:53

几年前，我在一个代码中使用了这个逻辑，我开始使用this。我的代码是这样的：

shared_dict = {
    # big dict with ~10k keys, each with a list of dicts
}

balancer = engines.load_balanced_view()

with engines[:].sync_imports(): # your 'view' variable 
    import pandas as pd
    import ujson as json

engines[:].push(shared_dict)

results = balancer.map(lambda i: (i, my_func(i)), id)
results_data = results.get()

If simulation counts are small (~50), then it takes a while to get started, but i start to see progress print statements. Strangely, multiple tasks will get assigned to the same engine and I don't see a response until all of those assigned tasks are completed for that engine. I would expect to see a response from enumerate(ar) every time a single simulation task completes.

在我的例子中，my_func()是一个复杂的方法，我把许多日志消息写入一个文件，所以我有了print语句。在

关于任务分配，正如我使用load_balanced_view()，我离开了库找到它的方式，它做得很好。在

If simulation counts are large (~1000), it takes a long time to get started, i see the CPUs throttle up on all engines, but no progress print statements are seen until a long time (~40mins), and when I do see progress, it appears a large block (>100) of tasks went to same engine, and awaited completion from that one engine before providing some progress. When that one engine did complete, i saw the ar object provided new responses ever 4 secs - this may have been the time delay to write the output pickle files.

很长时间以来，我都没有经历过，所以我不能说什么。在

我希望这能对你的问题有所启示。在

注：正如我在评论中所说，你可以试试multiprocessing.Pool。我想我还没有尝试过将一个大的只读数据作为一个全局变量来使用它。我想试试看，因为it seems to work。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在ipyparallel客户端和远程引擎之间最好地共享静态数据？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >