如何在Dask数据框的.apply()中指定'meta'参数?

1 投票
1 回答
31 浏览
提问于 2025-04-14 18:03

我有一个Dask数据框,里面记录了三只宠物:Olive、George和Maggie最喜欢的零食。四列数据中有三列是重复的,只有第四列snack是独一无二的。age这一列是整数,其他的都是字符串。

输入:

     pet_name  species  age           snack
0   Olive      cat    7          yogurt
1   Olive      cat    7         chicken
2  George  hamster    1      strawberry
3  George  hamster    1  sunflower seed
4  George  hamster    1        cucumber
5  Maggie      dog   12   peanut butter

我想根据前三列进行分组,把snack这一列的内容聚合成列表,按照年龄排序,并重置索引,最终得到每只宠物一行,里面有它们最喜欢的零食列表,像这样:

期望输出:

     pet_name  species  age                               snack
0  George  hamster    1  strawberry,sunflower seed,cucumber
1   Olive      cat    7                      yogurt,chicken
2  Maggie      dog   12                       peanut butter

我在使用groupby.apply(),大部分情况下都能正常工作,但在写Dask的meta参数时遇到了麻烦。

ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=??).reset_index()

我使用的是Dask 2024.2.1和Pandas 2.2.1。

输入:

# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False}) 
import dask.dataframe as dd
import pandas as pd

# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)

# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)

# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()

# sort by 'age'
ddf = ddf.sort_values("age")

# print result
print(ddf.compute())

期望输出:

     pet_name  species  age                               snack
0  George  hamster    1  strawberry,sunflower seed,cucumber
1   Olive      cat    7                      yogurt,chicken
2  Maggie      dog   12                       peanut butter

实际输出:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
/home/madeline/.config/spyder-py3/temp.py:24: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
     pet_name  species  age                               snack
0  George  hamster    1  cucumber,strawberry,sunflower seed
0   Olive      cat    7                      yogurt,chicken
0  Maggie      dog   12                       peanut butter

在这个简单的例子中,输出是可以的,只是索引全是零,但我收到了一个警告,提示我没有指定meta参数。那我该怎么指定meta呢?

我尝试过的事情:

我尝试指定meta的方法有:

尝试一和二:

Meta 1:

meta=pd.DataFrame({'pet_name': str, 'species': str, 'age': int, 'snack': str}, index=[0])

Meta 2:

meta={'pet_name': 'f8', 'species': 'f8', 'age': 'f8', 'snack': 'f8'}

尝试1和2的错误:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):

  File "/home/madeline/.config/spyder-py3/temp.py", line 26, in <module>
    ddf = ddf.sort_values("age")

ValueError: cannot insert name, already exists

尝试三:

Meta 3:

meta = pd.DataFrame(columns=['pet_name', 'species', 'age', 'snack'], dtype=object)

尝试3的错误:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):

  File "/home/madeline/.config/spyder-py3/temp.py", line 30, in <module>
    ddf = ddf.sort_values("age")

AttributeError: 'DataFrame' object has no attribute 'name'

1 个回答

0

因为你在一个序列(Series)上使用了apply,所以你应该用一个序列或者元组(Tuple)对象来作为meta。这里的关键是,当你使用reset_index()时,你会回到一个数据框(Dataframe),但我没有找到告诉Dask这个数据框里会有什么内容的方法,因此下面代码中的sort_values部分在Dask数据框上没有起作用:

# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False}) 
import dask.dataframe as dd
import pandas as pd

# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)

# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)

# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=('snack', 'object')).reset_index()

# print result
print(ddf.compute().sort_values("age"))

撰写回答