如何在Dask数据框的.apply()中指定'meta'参数?
我有一个Dask数据框,里面记录了三只宠物:Olive、George和Maggie最喜欢的零食。四列数据中有三列是重复的,只有第四列snack
是独一无二的。age
这一列是整数,其他的都是字符串。
输入:
pet_name species age snack
0 Olive cat 7 yogurt
1 Olive cat 7 chicken
2 George hamster 1 strawberry
3 George hamster 1 sunflower seed
4 George hamster 1 cucumber
5 Maggie dog 12 peanut butter
我想根据前三列进行分组,把snack
这一列的内容聚合成列表,按照年龄排序,并重置索引,最终得到每只宠物一行,里面有它们最喜欢的零食列表,像这样:
期望输出:
pet_name species age snack
0 George hamster 1 strawberry,sunflower seed,cucumber
1 Olive cat 7 yogurt,chicken
2 Maggie dog 12 peanut butter
我在使用groupby.apply()
,大部分情况下都能正常工作,但在写Dask的meta
参数时遇到了麻烦。
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=??).reset_index()
我使用的是Dask 2024.2.1和Pandas 2.2.1。
输入:
# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False})
import dask.dataframe as dd
import pandas as pd
# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)
# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)
# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
# sort by 'age'
ddf = ddf.sort_values("age")
# print result
print(ddf.compute())
期望输出:
pet_name species age snack
0 George hamster 1 strawberry,sunflower seed,cucumber
1 Olive cat 7 yogurt,chicken
2 Maggie dog 12 peanut butter
实际输出:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
/home/madeline/.config/spyder-py3/temp.py:24: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
pet_name species age snack
0 George hamster 1 cucumber,strawberry,sunflower seed
0 Olive cat 7 yogurt,chicken
0 Maggie dog 12 peanut butter
在这个简单的例子中,输出是可以的,只是索引全是零,但我收到了一个警告,提示我没有指定meta
参数。那我该怎么指定meta
呢?
我尝试过的事情:
我尝试指定meta
的方法有:
尝试一和二:
Meta 1:
meta=pd.DataFrame({'pet_name': str, 'species': str, 'age': int, 'snack': str}, index=[0])
Meta 2:
meta={'pet_name': 'f8', 'species': 'f8', 'age': 'f8', 'snack': 'f8'}
尝试1和2的错误:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):
File "/home/madeline/.config/spyder-py3/temp.py", line 26, in <module>
ddf = ddf.sort_values("age")
ValueError: cannot insert name, already exists
尝试三:
Meta 3:
meta = pd.DataFrame(columns=['pet_name', 'species', 'age', 'snack'], dtype=object)
尝试3的错误:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):
File "/home/madeline/.config/spyder-py3/temp.py", line 30, in <module>
ddf = ddf.sort_values("age")
AttributeError: 'DataFrame' object has no attribute 'name'
1 个回答
0
因为你在一个序列(Series)上使用了apply,所以你应该用一个序列或者元组(Tuple)对象来作为meta。这里的关键是,当你使用reset_index()时,你会回到一个数据框(Dataframe),但我没有找到告诉Dask这个数据框里会有什么内容的方法,因此下面代码中的sort_values
部分在Dask数据框上没有起作用:
# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False})
import dask.dataframe as dd
import pandas as pd
# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)
# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)
# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=('snack', 'object')).reset_index()
# print result
print(ddf.compute().sort_values("age"))