在groupby子句中获取列平均值

2024-06-07 19:17:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个演员和导演的数据集,以及他们合作的电影的受欢迎程度。你知道吗

print (actors_director_df.head(3))

                 actor         director  popularity counter
0          Chris Pratt  Colin Trevorrow   32.985763       0
1  Bryce Dallas Howard  Colin Trevorrow   32.985763       0
2          Irrfan Khan  Colin Trevorrow   32.985763       0

我想用演员和导演来分组,因为一对演员可以在多部电影中工作。我成功地使用了下面的查询。你知道吗

actor_director_grouped = actors_director_df.groupby(['actor','director']) \
                         .size() \
                         .reset_index(name='count') \
                         .sort_values(['count'], ascending=False) \
                         .head(10)

print (actor_director_grouped)

                      actor            director  count
3619         Clint Eastwood      Clint Eastwood     14
19272           Woody Allen         Woody Allen     12
9606            Johnny Depp          Tim Burton      8

但是在这个DF中,popularity列不见了。你知道吗

我想做的是在groupby之后做一个平均人气专栏,在演员和导演面前展示平均人气以及他们一起拍的电影数量。你知道吗

也就是说,我的理想输出应该是这样的。你知道吗

                      actor            director  popularity count
3619         Clint Eastwood      Clint Eastwood   32.985763    14
19272           Woody Allen         Woody Allen   5.1231231    12
9606            Johnny Depp          Tim Burton   3.1231231    8

Tags: df电影countactorsactordirectorprintcolin
2条回答

查看数据帧counter列似乎没有必要。让我们改为使用流行列并创建一个mean和一个count列:

import pandas as pd
import numpy as np

np.random.seed(444)

names = [
    'Robert Baratheon',
    'Jon Snow',
    'Daenerys Targaryen',
    'Theon Greyjoy',
    'Tyrion Lannister'
]

df = pd.DataFrame({
    'actor': np.random.choice(names, size=10, p = [0.2,0.2,0.2,0.1,0.3]),
    'director': np.random.choice(names, size=10, p = [0.4,0.1,0.1,0.1,0.3]),
    'popularity': np.random.randint(0,100, size=10),
    'counter': 0
})

df2 = df.groupby(['actor','director'])['popularity']\
        .agg(['count', 'mean'])\
        .reset_index()\
        .sort_values(by='mean', ascending=False)

print(df2)

退货:

              actor          director  count  mean
0          Jon Snow  Robert Baratheon      2  53.5
5  Tyrion Lannister  Tyrion Lannister      2  49.0
2  Robert Baratheon  Tyrion Lannister      2  48.5
1  Robert Baratheon          Jon Snow      2  40.5
4     Theon Greyjoy  Tyrion Lannister      1  13.0
3     Theon Greyjoy  Robert Baratheon      1   7.0

我冒昧地添加了一些伪数据,这将有助于更好地理解groupby子句。你知道吗

print(df)

输出:

                   actor           director  popularity  counter
0           Chris Pratt    Colin Trevorrow   32.985763        0
1   Bryce Dallas Howard    Colin Trevorrow   32.985763        0
2           Irrfan Khan    Colin Trevorrow   32.985763        0
3           Irrfan Khan    Colin Trevorrow   60.000000       12
4           Irrfan Khan       John Markson   10.000000       10
5           Irrfan Khan       Mark Johnson  100.000000        4

然后您需要在actordirectorgroupby,然后为popularity找到mean,为count找到sum。你知道吗

g = df.groupby(['actor', 'director'], as_index=False)
count = g.size().reset_index(name='count')
grp = g.agg({'popularity':'mean'})
grp.merge(count)

输出:

                 actor         director  popularity  count
0  Bryce Dallas Howard  Colin Trevorrow   32.985763      1
1          Chris Pratt  Colin Trevorrow   32.985763      1
2          Irrfan Khan  Colin Trevorrow   46.492881      2
3          Irrfan Khan     John Markson   10.000000      1
4          Irrfan Khan     Mark Johnson  100.000000      1

相关问题 更多 >

    热门问题