在dask DF中跨多个列展开一个类似列表的列

df = pd.DataFrame({'a': [np.random.randint(100, size=4) for _ in range(20)]}) dask_df = dd.from_pandas(df, chunksize=10) dask_df['a'].compute() 0 [52, 38, 59, 78] 1 [79, 71, 13, 63] 2 [15, 81, 79, 76] 3 [53, 4, 94, 62] 4 [91, 34, 26, 92] 5 [96, 1, 69, 27] 6 [84, 91, 96, 68] 7 [93, 56, 45, 40] 8 [54, 1, 96, 76] 9 [27, 11, 79, 7] 10 [27, 60, 78, 23] 11 [56, 61, 88, 68] 12 [81, 10, 79, 65] 13 [34, 49, 30, 3] 14 [32, 46, 53, 62] 15 [20, 46, 87, 31] 16 [89, 9, 11, 4] 17 [26, 46, 19, 27] 18 [79, 44, 45, 56] 19 [22, 18, 31, 90] Name: a, dtype: object

print(type(dask_df.values), type(x)) <class 'dask.array.core.Array'> <class 'dask.array.core.Array'> print(type(dask_df.values.compute()[0]), type(x.compute()[0])) <class 'numpy.ndarray'> <class 'numpy.ndarray'>

dask_groups = dask_df.explode('a').reset_index().groupby('index') final_df = [] for idx in dask_df.index.values.compute(): group = dask_groups.get_group(idx).drop(columns='index').compute() group_size = list(range(len(group))) row = group.transpose() row.columns = group_size row['index'] = idx final_df.append(dd.from_pandas(row, chunksize=10)) final_df = dd.concat(final_df).set_index('index')

3条回答

网友

1楼 · 编辑于 2024-06-02 09:07:35

我有一个有效的解决方案。我的原始函数创建了一个列表，该列表生成列表列，如上所述。更改应用的函数以返回dask包似乎可以实现以下目的：

def create_df_row(x):
    vals = np.random.randint(2, size=4)
    return db.from_sequence([vals], partition_size=2).to_dataframe()

test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()

mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()

但不确定这是否解决了内存问题，因为现在我持有一个groupby结果列表

网友

2楼 · 编辑于 2024-06-02 09:07:35

以下是如何手动跨多个列展开类似列表的列：

dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]

print(dask_df.head())

                  a  a0  a1  a2  a3
0   [71, 16, 0, 10]  71  16   0  10
1  [59, 65, 99, 74]  59  65  99  74
2  [83, 26, 33, 38]  83  26  33  38
3   [70, 5, 19, 37]  70   5  19  37
4    [0, 59, 4, 80]   0  59   4  80

苏丹诺拉兹巴耶夫的回答似乎更加优雅

网友

3楼 · 编辑于 2024-06-02 09:07:35

在这种情况下，dask不知道从结果中可以得到什么，因此最好明确指定meta：


# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)

new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()

相关问题更多 >

编程相关推荐

热门问题

热门文章