获取大Pandas不同群体的事件总数

2024-04-20 04:09:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个这样的结构:

country product     date_install    date_purchase   user_id
BR      yearly      2020-11-01      2020-11-01      10660236
CA      monthly     2020-11-01      2020-11-01      10649441
US      yearly      2020-11-01      trialed         10660272
IT      monthly     2020-11-01      2020-11-01      10657634
AE      monthly     2020-11-01      2020-11-01      10661442
IT      monthly     2020-11-01      trialed         10657634
AE      monthly     2020-11-01      trialed         10661442

我试图获得购买/安装的比率以及每个country, product, date的实际安装和购买数量date_install是安装日期,date_purchase确定购买日期,并且购买已经发生,date_purchase中的trialed值表示没有为具有user_id的用户进行购买

所需的输出应如下所示:

country product     date_install        installs    purchases       ratio
US      daily       2021-02-05          100         20              0.2
US      monthly     2021-02-05          100         50              0.5
US      yearly      2021-02-05          100         50              0.5             
US      trialed     2021-02-05          100         0               0    
# the next day
US      daily       2021-02-06          500         50              0.1
US      monthly     2021-02-06          500         100             0.2
US      yearly      2021-02-06          500         250             0.5             
US      trialed     2021-02-06          500         0               0    
# the rest of the countries & the rest of the days

{}将是当天{}总数的一个数字,国家和地区;产品purchases将是每天的date_purchase活动总数,国家和地区;产品

我们的想法是,对于一个特定的国家;第X天,人们安装了一个应用程序,其中一些人购买了产品,而另一些人没有。那些已经购买的人拥有日期的date_purchase值,而那些没有trialed值的人。但是,安装了应用程序的人的总数是每个国家、产品和服务的date_install;安装日期

我尝试的是:

exp = df.groupby(['country','product','date_install']).count()
.sort_values('date_install',ascending=False).reset_index()

exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()
exp['total_installs'] = exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()

但我有一个错误:

ValueError: Wrong number of items passed 4, placement implies 1

我不认为我试图实现这一目标的方式是正确的。实现预期结果的最佳方式/逻辑是什么

更新

使用@jezrael的答案后:

df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
exp[(exp['date_install']=='2020-11-18') & (exp['country']=='US')]

返回

country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          1           1           1.0
US          trialed         2020-11-18          4924        0           0.0
US          renders.100     2020-11-18          2           2           1.0
US          renders.20      2020-11-18          3           3           1.0
US          monthly         2020-11-18          37          37          1.0
US          yearly          2020-11-18          6           6           1.0
US          textures        2020-11-18          1           1           1.0

这是不正确的,因为每行中的installs应该是给定的country & date_install集合的总安装数的总和

在返回中,我更新了country&;的安装值;day需要是country&;的所有安装的总和;day,在这种情况下,安装的每个值都需要是1+4924+2+3+37+6+1,这将是给定国家/地区的真实安装&;一天,然后比率会有意义,现在installs == purchases,这是不正确的。我试图回答:for a given day and country, how many people installed & purchased different products and what is their ratio

我需要它是:

country     product         date_install        installs    purchases   ratio
US          catalog30US     2020-11-18          4974        1           1 / 4974
US          trialed         2020-11-18          4974        0           0.0
US          renders.100     2020-11-18          4974        2           2 / 4974
US          renders.20      2020-11-18          4974        3           3 / 4974
US          monthly         2020-11-18          4974        37          37 / 4974
US          yearly          2020-11-18          4974        6           6 / 4974
US          textures        2020-11-18          4974        1           1 / 4974

Tags: installthedate国家productpurchasecountryus
1条回答
网友
1楼 · 发布于 2024-04-20 04:09:51

我认为,对于缺少值的计数,您需要按^{}进行聚合,对于排除缺少值的计数,您需要按^{}进行聚合,然后对列进行除法:

df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install'])
         .agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))

#sum per country and install date
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])

exp = exp.reset_index()
print (exp)

相关问题 更多 >