在Pandas中使用'apply'（外部定义的函数）

1 投票

3 回答

4176 浏览

提问于 2025-04-18 07:34

我有一个数据表，叫做 table，它的样子是这样的：

year name     prop     sex  soundex
1880 John     0.081541 boy  J500
1880 William  0.080511 boy  W450
....
2008 Elianna  0.000127 girl E450

我想按照 'year' 这个字段来对 table 进行分组，并从每个组中选择 'name' 列的一些特定索引。

我的代码如下（假设 special_indices 已经定义好了）：

def get_indices_func(x):
    name = [x['name'].iloc[y] for y in special_indices]
    return pd.Series(name)


table.groupby(by='year').apply(get_indices_func)

但是我遇到了以下错误：

/Users/***/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in get_value(self, series, key)
    722         """
    723         try:
--> 724             return self._engine.get_value(series, key)
    725         except KeyError, e1:
    726             if len(self) > 0 and self.inferred_type == 'integer':

KeyError: 1000

我哪里出错了呢？我觉得我可能不太明白 apply（还有它的相关函数 aggregate 和 agg）是怎么工作的。如果有人能解释一下，我会非常感激！

数据处理错误调试 pandas 数据分组索引选择 apply function

3 个回答

试试这样做：定义一个聚合函数，这个函数会根据某个属性（prop）对每组数据进行排序（为此你需要先复制一份数据）。这个函数会返回那份复制并排序后的数据的第一行（也就是属性值最高的那一行）。然后把这个函数传给 .agg，并按年份对数据进行分组。

def get_most_popular(x):
    y = x.copy()
    y.sort('prop')
    return y.iloc[0]

df.groupby('year').agg(get_most_popular)

回答于 2025-04-18 由 Python大师

分享举报

我想找出每年最受欢迎的名字。有没有什么聪明的方法可以做到这一点呢？

其实有一种方法可以不需要排序就能做到这一点：假设你有一个像这样的数据表：

In [5]: df
Out[5]: 
   year     name      prop   sex soundex
0  1880     John  0.081541   boy    J500
1  1880  William  0.080511   boy    W450
2  2008  Elianna  0.000127  girl    E450

[3 rows x 5 columns]

你可以按年份分组，提取出“prop”这一列，然后使用 argmax 来找到最大值，并用 loc 来选择你想要的行：

In [15]: df.loc[df.groupby('year')['prop'].apply(lambda x: x.argmax())]
Out[15]: 
   year     name      prop   sex soundex
0  1880     John  0.081541   boy    J500
2  2008  Elianna  0.000127  girl    E450

[2 rows x 5 columns]

In [19]: df['name'].loc[df.groupby('year')['prop'].apply(lambda x: x.argmax())]
Out[19]: 
0       John
2    Elianna
Name: name, dtype: object

需要注意的是，使用 argmax 和 loc 的前提是数据表 df 必须有唯一的索引。如果数据表没有唯一索引，你需要先把索引变成唯一的：

df.reset_index()

另外要知道的是，argmax 的操作复杂度是 O(n)，而排序的复杂度是 O(n log n)。即使是对于小的数据表，这个速度差异也是很明显的：

In [125]: %timeit df[['year', 'name']].loc[df.groupby('year')['prop'].apply(lambda x: x.argmax())]
1000 loops, best of 3: 1.07 ms per loop

In [126]: %timeit df.groupby('year').apply(lambda x: x.sort('prop', ascending=False).iloc[0]['name'])
100 loops, best of 3: 2.14 ms per loop

这个基准测试是在这个数据表上进行的：

In [131]: df
Out[131]: 
   year     name      prop   sex soundex
0  2008        A  0.000027  girl    E450
1  1880     John  0.081541   boy    J500
2  2008        B  0.000027  girl    E450
3  2008  Elianna  0.000127  girl    E450
4  1880  William  0.080511   boy    W450
5  2008        C  0.000027  girl    E450
6  1880        D  0.080511   boy    W450

[7 rows x 5 columns]

回答于 2025-04-18 由 Python大师

分享举报

另一种解决方案：

df.groupby('year').apply(lambda x: x.sort('prop', ascending=False).iloc[0]['name'])

这里发生了什么呢？

首先，就像Woody那样，我们要按正确的列进行分组。apply()这个函数会把每个组的数据传递给你指定的函数。为了更好理解，我其实可以写成

define takeAGroupAndGiveBackMax(group):
    # year level data: first sort it by prop, descending
    group.sort('prop', ascending=False, inplace=True)
    # now return value 'name' of the first entry
    return group.iloc[0]['name']

# the following will give you a data set, indexed on whatever you grouped it by (here: year), and have a columns all the properties you return.    
df.groupby('year').apply(takeAGroupAndGiveBackMax)

为了理解这些，你可以试着玩一下这个函数。试着返回多个列、多个行，你会看到apply()给你返回了什么。它真的是pandas提供给你的一个强大工具。

回答于 2025-04-18 由 Python大师

分享举报

在Pandas中使用'apply'（外部定义的函数）

3 个回答

撰写回答