
2024-06-07 05:46:14 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑下面的简化示例数据帧^ {CD1>}:

Department     CustomerID    Date          Price     MenswearDemand  HomeDemand
0    Menswear      418089    2019-04-18    199             199           0
1    Menswear      613573    2019-04-24    199             199           0
2    Menswear      161840    2019-04-25    199             199           0
3    Menswear     2134926    2019-04-29    199             199           0
4    Menswear      984801    2019-04-30     19              19           0
5        Home      398555    2019-01-27     52               0          52
6    Menswear      682906    2019-02-03     97              97           0
7    Menswear      682906    2019-02-03     97              97           0
8    Menswear      923491    2019-02-09     80              80           0
9    Menswear     1098782    2019-02-25    258             258           0
10   Menswear      721696    2019-03-25     12              12           0
11   Menswear      695706    2019-04-10    129             129           0
12  Underwear      637026    2019-01-18    349               0           0
13  Underwear      205997    2019-01-25    279               0           0
14  Underwear      787984    2019-02-01     27               0           0
15  Underwear      318256    2019-02-01    279               0           0
16  Underwear      570454    2019-02-14    262               0           0
17  Underwear     1239118    2019-02-28    279               0           0
18       Home     1680791    2019-04-04   1398               0        1398


  1. 将购买日期'Date'转换为截止日期之前的天数,即'2021-01-01'。这正是从客户最近购买到'2021-01-01'的时间
  2. 对所有剩余需求列求和,在本例中仅为'MenswearDemand''HomeDemand'


            Date  MenswearDemand  HomeDemand
161840         6             199           0
205997        96               0           0
318256        89               0           0
398555        94               0          52
418089        13             199           0
570454        76               0           0
613573         7             199           0
637026       103               0           0
682906        87             194           0
695706        21             129           0
721696        37              12           0
787984        89               0           0
923491        81              80           0
984801         1              19           0
1098782       65             258           0
1239118       62               0           0
1680791       27               0        1398
2134926        2             199           0


df['Date'] = pd.to_datetime(df['Date'])
cutoffDate = df['Date'].max() + dt.timedelta(days = 1)
newdf = df.groupby('CustomerID').agg({'Date': lambda x: (cutoffDate - x.max()).days,
                                      'MenswearDemand': lambda x: x.sum(),
                                      'HomeDemand': lambda x: x.sum()})

然而,实际上我得到了大约1500万行和30个需求列。我真的不想每次都在聚合函数中写入所有这些'DemandColumn': lambda x: x.sum(),因为它们都应该求和。有更好的方法吗?比如传入要对其执行特定操作的列子集的数组

Tags: 数据lambda示例dfhomedatedaysmax


from datetime import datetime, timedelta

from convtools import conversion as c
from convtools.contrib.tables import Table

# this way you can define multiple meaningful metrics
metric_to_config = {
    "sum_MenswearDemand": {
        "column": "MenswearDemand",
        "reducer": c.ReduceFuncs.Sum,
    "sum_HomeDemand": {"column": "HomeDemand", "reducer": c.ReduceFuncs.Sum},
    "median_Age": {"column": "Age", "reducer": c.ReduceFuncs.Median},

# pass required metric names as input
required_metrics = ["sum_MenswearDemand", "sum_HomeDemand"]

# prepare aggregation config
parse_date = c.call_func(datetime.strptime, c.this(), "%Y-%m-%d").call_method(
aggregate_config = {
    "CustomerID": c.item("CustomerID"),
    "Date": c.ReduceFuncs.Max(c.item("Date")).pipe(parse_date),
for metric in required_metrics:
    config = metric_to_config[metric]
    reducer = config["reducer"]
    column = config["column"]
    aggregate_config[metric] = reducer(c.item(column))

# this is where code generation happens
converter = (
        # total max is calculated below and saved under "max_date" label
        # here we replace "Date" with day diffs
                "Date", (c.label("max_date") - c.item("Date")).attr("days")
        # calculate max date from aggregation results
            "max_date": (
                c.call_func(max, c.iter(c.item("Date")))
                + timedelta(days=1)

# reading required columns from input csv file
rows = (
        *{metric_to_config[metric]["column"] for metric in required_metrics},
# aggregating input rows
iterable_of_results = converter(rows)

# outputting to csv file if needed



cutoffDate = df['Date'].max() + dt.timedelta(days=1)

agg_dict = {'Date': lambda x: (cutoffDate - x.max()).days}

DemandColumns = ['MenswearDemand', 'HomeDemand']
f = lambda x: x.sum()
agg_dict.update({col_name: f for col_name in DemandColumns})

newdf = df.groupby('CustomerID').agg(agg_dict)


cutoffDate = df['Date'].max() + dt.timedelta(days=1)
groups = df.groupby('CustomerID')
newdf = groups.agg(lambda x: (cutoffDate - x.max()).days)
newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(items=DemandColumns).agg(sum))], axis=1)


newdf = pd.concat([newdf, groups.apply(lambda x: x.filter(regex='.*Demand$').agg(sum))], axis=1)

相关问题 更多 >