在Python/Pandas中,在执行不同操作的许多列上有条件地聚合分组数据

2024-06-07 05:46:14 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑下面的简化示例数据帧^ {CD1>}:

Department     CustomerID    Date          Price     MenswearDemand  HomeDemand
0    Menswear      418089    2019-04-18    199             199           0
1    Menswear      613573    2019-04-24    199             199           0
2    Menswear      161840    2019-04-25    199             199           0
3    Menswear     2134926    2019-04-29    199             199           0
4    Menswear      984801    2019-04-30     19              19           0
5        Home      398555    2019-01-27     52               0          52
6    Menswear      682906    2019-02-03     97              97           0
7    Menswear      682906    2019-02-03     97              97           0
8    Menswear      923491    2019-02-09     80              80           0
9    Menswear     1098782    2019-02-25    258             258           0
10   Menswear      721696    2019-03-25     12              12           0
11   Menswear      695706    2019-04-10    129             129           0
12  Underwear      637026    2019-01-18    349               0           0
13  Underwear      205997    2019-01-25    279               0           0
14  Underwear      787984    2019-02-01     27               0           0
15  Underwear      318256    2019-02-01    279               0           0
16  Underwear      570454    2019-02-14    262               0           0
17  Underwear     1239118    2019-02-28    279               0           0
18       Home     1680791    2019-04-04   1398               0        1398

我想根据'CustomerID'对这些数据进行分组,然后:

  1. 将购买日期'Date'转换为截止日期之前的天数,即'2021-01-01'。这正是从客户最近购买到'2021-01-01'的时间
  2. 对所有剩余需求列求和,在本例中仅为'MenswearDemand''HomeDemand'

我应该得到的结果是:

            Date  MenswearDemand  HomeDemand
CustomerID                                  
161840         6             199           0
205997        96               0           0
318256        89               0           0
398555        94               0          52
418089        13             199           0
570454        76               0           0
613573         7             199           0
637026       103               0           0
682906        87             194           0
695706        21             129           0
721696        37              12           0
787984        89               0           0
923491        81              80           0
984801         1              19           0
1098782       65             258           0
1239118       62               0           0
1680791       27               0        1398
2134926        2             199           0

我就是这样设法解决这个问题的:

df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
                                      'MenswearDemand': lambda x: x.sum(),
                                      'HomeDemand': lambda x: x.sum()})

然而,实际上我得到了大约1500万行和30个需求列。我真的不想每次都在聚合函数中写入所有这些'DemandColumn': lambda x: x.sum(),因为它们都应该求和。有更好的方法吗?比如传入要对其执行特定操作的列子集的数组


Tags: 数据lambda示例dfhomedatedaysmax
2条回答

仅给出一个基于convtools的备选方案:

from datetime import datetime, timedelta

from convtools import conversion as c
from convtools.contrib.tables import Table


# this way you can define multiple meaningful metrics
metric_to_config = {
    "sum_MenswearDemand": {
        "column": "MenswearDemand",
        "reducer": c.ReduceFuncs.Sum,
    },
    "sum_HomeDemand": {"column": "HomeDemand", "reducer": c.ReduceFuncs.Sum},
    "median_Age": {"column": "Age", "reducer": c.ReduceFuncs.Median},
}

# pass required metric names as input
required_metrics = ["sum_MenswearDemand", "sum_HomeDemand"]

# prepare aggregation config
parse_date = c.call_func(datetime.strptime, c.this(), "%Y-%m-%d").call_method(
    "date"
)
aggregate_config = {
    "CustomerID": c.item("CustomerID"),
    "Date": c.ReduceFuncs.Max(c.item("Date")).pipe(parse_date),
}
for metric in required_metrics:
    config = metric_to_config[metric]
    reducer = config["reducer"]
    column = config["column"]
    aggregate_config[metric] = reducer(c.item(column))

# this is where code generation happens
converter = (
    c.group_by(c.item("CustomerID"))
    .aggregate(aggregate_config)
    .pipe(
        # total max is calculated below and saved under "max_date" label
        # here we replace "Date" with day diffs
        c.iter_mut(
            c.Mut.set_item(
                "Date", (c.label("max_date") - c.item("Date")).attr("days")
            )
        ),
        # calculate max date from aggregation results
        label_input={
            "max_date": (
                c.call_func(max, c.iter(c.item("Date")))
                + timedelta(days=1)
            )
        },
    )
    .gen_converter()
)

# reading required columns from input csv file
rows = (
    Table.from_csv(
        "tmp/input_1.csv",
        header=True,
        dialect=Table.csv_dialect(delimiter="\t"),
    )
    .take(
        "CustomerID",
        "Date",
        *{metric_to_config[metric]["column"] for metric in required_metrics},
    )
    .into_iter_rows(dict)
)
# aggregating input rows
iterable_of_results = converter(rows)

# outputting to csv file if needed
Table.from_rows(iterable_of_results).into_csv(
    "tmp/out.csv",
    dialect=Table.csv_dialect(delimiter="\t"),
)

使用这个库的好处是它是轻量级的,没有依赖性,允许流处理,而且由于生成的代码简单,有时它并不比pandas/polars慢

如果预先知道列名,则可以在传递到agg函数之前构建字典

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)

agg_dict = {'Date': lambda x: (cutoffDate - x.max()).days}

DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update({col_name: f for col_name in DemandColumns})

newdf = df.groupby('CustomerID').agg(agg_dict)

另一个选项(知道列名,DemandColumns在上一个示例中)是首先使用agg函数计算Date列,然后使用^{}函数传递所需列的列表作为items参数,以仅保留那些精确的列

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)

如果所需的列(DemandColumns)遵循给定的模式,则可以排除列表创建,并将filter函数与regex参数一起使用。在这种情况下,可以使用regex'.*Demand$'返回以Demand字符串结尾的所有列

newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)

相关问题 更多 >