在Python/Pandas中，在执行不同操作的许多列上有条件地聚合分组数据

Department CustomerID Date Price MenswearDemand HomeDemand 0 Menswear 418089 2019-04-18 199 199 0 1 Menswear 613573 2019-04-24 199 199 0 2 Menswear 161840 2019-04-25 199 199 0 3 Menswear 2134926 2019-04-29 199 199 0 4 Menswear 984801 2019-04-30 19 19 0 5 Home 398555 2019-01-27 52 0 52 6 Menswear 682906 2019-02-03 97 97 0 7 Menswear 682906 2019-02-03 97 97 0 8 Menswear 923491 2019-02-09 80 80 0 9 Menswear 1098782 2019-02-25 258 258 0 10 Menswear 721696 2019-03-25 12 12 0 11 Menswear 695706 2019-04-10 129 129 0 12 Underwear 637026 2019-01-18 349 0 0 13 Underwear 205997 2019-01-25 279 0 0 14 Underwear 787984 2019-02-01 27 0 0 15 Underwear 318256 2019-02-01 279 0 0 16 Underwear 570454 2019-02-14 262 0 0 17 Underwear 1239118 2019-02-28 279 0 0 18 Home 1680791 2019-04-04 1398 0 1398

Date MenswearDemand HomeDemand CustomerID 161840 6 199 0 205997 96 0 0 318256 89 0 0 398555 94 0 52 418089 13 199 0 570454 76 0 0 613573 7 199 0 637026 103 0 0 682906 87 194 0 695706 21 129 0 721696 37 12 0 787984 89 0 0 923491 81 80 0 984801 1 19 0 1098782 65 258 0 1239118 62 0 0 1680791 27 0 1398 2134926 2 199 0

df['Date'] = pd.to_datetime(df['Date']) cutoffDate = df['Date'].max() + dt.timedelta(days = 1) newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days, 'MenswearDemand': lambda x: x.sum(), 'HomeDemand': lambda x: x.sum()})

2条回答

网友

1楼 · 编辑于 2024-06-07 05:46:14

仅给出一个基于convtools的备选方案：

from datetime import datetime, timedelta

from convtools import conversion as c
from convtools.contrib.tables import Table


# this way you can define multiple meaningful metrics
metric_to_config = {
    "sum_MenswearDemand": {
        "column": "MenswearDemand",
        "reducer": c.ReduceFuncs.Sum,
    },
    "sum_HomeDemand": {"column": "HomeDemand", "reducer": c.ReduceFuncs.Sum},
    "median_Age": {"column": "Age", "reducer": c.ReduceFuncs.Median},
}

# pass required metric names as input
required_metrics = ["sum_MenswearDemand", "sum_HomeDemand"]

# prepare aggregation config
parse_date = c.call_func(datetime.strptime, c.this(), "%Y-%m-%d").call_method(
    "date"
)
aggregate_config = {
    "CustomerID": c.item("CustomerID"),
    "Date": c.ReduceFuncs.Max(c.item("Date")).pipe(parse_date),
}
for metric in required_metrics:
    config = metric_to_config[metric]
    reducer = config["reducer"]
    column = config["column"]
    aggregate_config[metric] = reducer(c.item(column))

# this is where code generation happens
converter = (
    c.group_by(c.item("CustomerID"))
    .aggregate(aggregate_config)
    .pipe(
        # total max is calculated below and saved under "max_date" label
        # here we replace "Date" with day diffs
        c.iter_mut(
            c.Mut.set_item(
                "Date", (c.label("max_date") - c.item("Date")).attr("days")
            )
        ),
        # calculate max date from aggregation results
        label_input={
            "max_date": (
                c.call_func(max, c.iter(c.item("Date")))
                + timedelta(days=1)
            )
        },
    )
    .gen_converter()
)

# reading required columns from input csv file
rows = (
    Table.from_csv(
        "tmp/input_1.csv",
        header=True,
        dialect=Table.csv_dialect(delimiter="\t"),
    )
    .take(
        "CustomerID",
        "Date",
        *{metric_to_config[metric]["column"] for metric in required_metrics},
    )
    .into_iter_rows(dict)
)
# aggregating input rows
iterable_of_results = converter(rows)

# outputting to csv file if needed
Table.from_rows(iterable_of_results).into_csv(
    "tmp/out.csv",
    dialect=Table.csv_dialect(delimiter="\t"),
)

使用这个库的好处是它是轻量级的，没有依赖性，允许流处理，而且由于生成的代码简单，有时它并不比pandas/polars慢

网友

2楼 · 编辑于 2024-06-07 05:46:14

如果预先知道列名，则可以在传递到agg函数之前构建字典

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)

agg_dict = {'Date': lambda x: (cutoffDate - x.max()).days}

DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update({col_name: f for col_name in DemandColumns})

newdf = df.groupby('CustomerID').agg(agg_dict)

另一个选项（知道列名，DemandColumns在上一个示例中）是首先使用agg函数计算Date列，然后使用^{}函数传递所需列的列表作为items参数，以仅保留那些精确的列

...
cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)

如果所需的列（DemandColumns）遵循给定的模式，则可以排除列表创建，并将filter函数与regex参数一起使用。在这种情况下，可以使用regex'.*Demand$'返回以Demand字符串结尾的所有列

newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章