Python:计算一段时间内数据帧中的累积量

2024-04-18 23:49:03 发布

您现在位置:Python中文网/ 问答频道 /正文

目标:计算自2020-01-01年以来的累计收入

我有一个python字典,如下所示

data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
    {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
    {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
    {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
    {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
    {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
    {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
    {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]

上面的同一本字典看起来像一张桌子

   game_id   user_id  amt  activity date
  'Racing', 'ABC123', 5,   '2020-01-01'
  'Racing', 'ABC123', 1,   '2020-01-04'
  'Racing', 'CDE123', 1,   '2020-01-04'
  'DH',     'CDE123', 100, '2020-01-03'
  'DH',     'CDE456', 10,  '2020-01-02'
  'DH', '    CDE789', 5,   '2020-01-02'
  'DH',     'CDE456', 1,   '2020-01-03'
  'DH',     'CDE456', 1,   '2020-01-03'

年龄计算为交易日期与2020-01-01之间的差异。总付款人计数是每个游戏中付款人的数量

我正在尝试创建一个数据帧,该数据帧具有从第一个事务的当天到事务的第二天的每天的累积结果。对于game_id赛车,我们从2020-01-01的5开始,所以年龄是0。在2020年1月2日,金额仍然是5,因为我们当天没有交易。2020年1月3日,金额为5。但在2020-01-04,金额为7,因为我们在这一天有2笔交易

预期产出

Game    Age    Cum_rev    Total_unique_payers_per_game
Racing  0      5          2
Racing  1      5          2
Racing  2      5          2
Racing  3      7          2
DH      0      0          3
DH      1      15         3
DH      2      117        3
DH      3      117        3

如何在python中使用窗口函数,就像我们在SQL中使用窗口函数一样。有没有更好的办法来解决这个问题


Tags: idgamedate字典交易金额dh年龄
1条回答
网友
1楼 · 发布于 2024-04-18 23:49:03

这里非常复杂的部分是填写日期。我用了一个应用程序,但我不确定这是最好的方式

import pandas as pd

data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},
        {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},
        {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},
        {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},
        {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},
        {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},
        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},
        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]

df = pd.DataFrame(data)
# we want datetime not object
df["date"] = df["date"].astype("M8[us]")

# we will need to merge this at the end
grp = df.groupby("game_id")['user_id']\
        .nunique()\
        .reset_index(name="Total_unique_payers_per_game")

# sum amt per game_id date
df = df.groupby(["game_id", "date"])["amt"].sum().reset_index()

# dates from 2020-01-01 till the max date in df
dates = pd.DataFrame({"date": pd.date_range("2020-01-01", df["date"].max())})

# add missing dates
def expand_dates(x):
    x = pd.merge(dates, x.drop("game_id", axis=1), how="left")
    x["amt"] = x["amt"].fillna(0)
    return x

df = df.groupby("game_id")\
       .apply(expand_dates)\
       .reset_index().drop("level_1", axis=1)

df["Cum_rev"] = df.groupby("game_id")['amt'].transform("cumsum")

# this is equivalent as long as data is sorted
# df["Cum_rev"] = df.groupby("game_id")['amt'].cumsum()

# merge unique payers per game
df = pd.merge(df, grp, how="left")

# dates difference
df["Age"] = "2020-01-01"
df["Age"] = df["Age"].astype("M8[us]")
df["Age"] = (df["date"]-df["Age"]).dt.days


# then you can eventually filter
df = df[["game_id", "Age", 
         "Cum_rev", "Total_unique_payers_per_game"]]\
       .rename(columns={"game_id":"Game"})

相关问题 更多 >