我有一个这样的结构:
country product date_install date_purchase user_id
BR yearly 2020-11-01 2020-11-01 10660236
CA monthly 2020-11-01 2020-11-01 10649441
US yearly 2020-11-01 trialed 10660272
IT monthly 2020-11-01 2020-11-01 10657634
AE monthly 2020-11-01 2020-11-01 10661442
IT monthly 2020-11-01 trialed 10657634
AE monthly 2020-11-01 trialed 10661442
我试图获得购买/安装的比率以及每个country, product, date
的实际安装和购买数量date_install
是安装日期,date_purchase
确定购买日期,并且购买已经发生,date_purchase
中的trialed
值表示没有为具有user_id
的用户进行购买
所需的输出应如下所示:
country product date_install installs purchases ratio
US daily 2021-02-05 100 20 0.2
US monthly 2021-02-05 100 50 0.5
US yearly 2021-02-05 100 50 0.5
US trialed 2021-02-05 100 0 0
# the next day
US daily 2021-02-06 500 50 0.1
US monthly 2021-02-06 500 100 0.2
US yearly 2021-02-06 500 250 0.5
US trialed 2021-02-06 500 0 0
# the rest of the countries & the rest of the days
{purchases
将是每天的date_purchase
活动总数,国家和地区;产品
我们的想法是,对于一个特定的国家;第X天,人们安装了一个应用程序,其中一些人购买了产品,而另一些人没有。那些已经购买的人拥有日期的date_purchase
值,而那些没有trialed
值的人。但是,安装了应用程序的人的总数是每个国家、产品和服务的date_install
;安装日期
我尝试的是:
exp = df.groupby(['country','product','date_install']).count()
.sort_values('date_install',ascending=False).reset_index()
exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()
exp['total_installs'] = exp.groupby(['country','product','date_install'])['date_purchase'].sum().reset_index()
但我有一个错误:
ValueError: Wrong number of items passed 4, placement implies 1
我不认为我试图实现这一目标的方式是正确的。实现预期结果的最佳方式/逻辑是什么
更新
使用@jezrael的答案后:
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)
exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count')))
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
exp[(exp['date_install']=='2020-11-18') & (exp['country']=='US')]
返回
country product date_install installs purchases ratio
US catalog30US 2020-11-18 1 1 1.0
US trialed 2020-11-18 4924 0 0.0
US renders.100 2020-11-18 2 2 1.0
US renders.20 2020-11-18 3 3 1.0
US monthly 2020-11-18 37 37 1.0
US yearly 2020-11-18 6 6 1.0
US textures 2020-11-18 1 1 1.0
这是不正确的,因为每行中的installs
应该是给定的country & date_install
集合的总安装数的总和
在返回中,我更新了country&;的安装值;day需要是country&;的所有安装的总和;day,在这种情况下,安装的每个值都需要是1+4924+2+3+37+6+1
,这将是给定国家/地区的真实安装&;一天,然后比率会有意义,现在installs == purchases
,这是不正确的。我试图回答:for a given day and country, how many people installed & purchased different products and what is their ratio
我需要它是:
country product date_install installs purchases ratio
US catalog30US 2020-11-18 4974 1 1 / 4974
US trialed 2020-11-18 4974 0 0.0
US renders.100 2020-11-18 4974 2 2 / 4974
US renders.20 2020-11-18 4974 3 3 / 4974
US monthly 2020-11-18 4974 37 37 / 4974
US yearly 2020-11-18 4974 6 6 / 4974
US textures 2020-11-18 4974 1 1 / 4974
我认为,对于缺少值的计数,您需要按^{} 进行聚合,对于排除缺少值的计数,您需要按^{} 进行聚合,然后对列进行除法:
相关问题 更多 >
编程相关推荐