Pandas:SQL SelfJoin和datecriteri

3条回答

网友

1楼 · 编辑于 2024-06-06 16:14:30

似乎您需要GroupBy+rolling。以与SQL中编写的逻辑完全相同的方式实现逻辑可能很昂贵，因为它将涉及重复的循环。以数据帧为例：

        Date  ID  Var1
0 2015-01-01   1     0
1 2015-02-01   1     1
2 2015-03-01   1     2
3 2015-04-01   1     3
4 2015-05-01   1     4
5 2015-01-01   2     5
6 2015-02-01   2     6
7 2015-03-01   2     7
8 2015-04-01   2     8
9 2015-05-01   2     9

您可以添加一个列，该列按组在一个固定的周期内回顾并求和一个变量。首先使用^{}定义函数：

^{pr2}$

然后将其应用于GroupBy对象并提取赋值值：

df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values

print(df)

        Date  ID  Var1  Lookback_Sum
0 2015-01-01   1     0             0
1 2015-02-01   1     1             1
2 2015-03-01   1     2             3
3 2015-04-01   1     3             6
4 2015-05-01   1     4             9
5 2015-01-01   2     5             5
6 2015-02-01   2     6            11
7 2015-03-01   2     7            18
8 2015-04-01   2     8            21
9 2015-05-01   2     9            24

似乎pd.Series.rolling与月份无关，例如使用'2M'（2个月）而不是{}（70天）得到{}。这是有道理的，因为“月”是模棱两可的，因为月份有不同的天数。在

另一点值得一提的是，您可以直接使用GroupBy+rolling，绕过apply，可能更有效，但这需要确保索引是monotic的。例如，通过sort_index：

df['Lookback_Sum'] = df.set_index('Date').sort_index()\
                       .groupby('ID')['Var1'].rolling('70D').sum()\
                       .astype(int).values

网友

2楼 · 编辑于 2024-06-06 16:14:30

我认为pandas.DataFrame.rolling()不支持滚动窗口聚合（以月为单位）；目前，必须指定固定的天数或其他固定长度的时间段。在

但正如@jpp所提到的，您可以使用python循环在以日历月为单位指定的窗口大小上执行滚动聚合，每个窗口中的天数会有所不同，这取决于您要滚动的日历的哪个部分。在

以下方法基于this SO answer和@jpp：

# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
                   'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
        Date  Var1  ID
0 2015-01-01     0   1
1 2015-01-02     1   1
2 2015-01-03     2   1
3 2015-01-04     3   1
4 2015-01-05     4   1

# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser): 
    return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count() 
                      for d in ser.index])

# By default, groupby.agg output is sorted by key. So make sure to 
# sort df by (ID, Date) before inserting the flattened groupby result 
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values

# Manually check the resulting window sizes
df.head()
            Var1  ID  window_size
Date                             
2015-01-01     0   1            1
2015-01-02     1   1            2
2015-01-03     2   1            3
2015-01-04     3   1            4
2015-01-05     4   1            5

df.tail()
            Var1  ID  window_size
Date                             
2015-12-27   360   3           92
2015-12-28   361   3           92
2015-12-29   362   3           92
2015-12-30   363   3           92
2015-12-31   364   3           93

df[df.ID == 1].loc['2015-05-25':'2015-06-05']
            Var1  ID  window_size
Date                             
2015-05-25   144   1           90
2015-05-26   145   1           90
2015-05-27   146   1           90
2015-05-28   147   1           90
2015-05-29   148   1           91
2015-05-30   149   1           92
2015-05-31   150   1           93
2015-06-01   151   1           93
2015-06-02   152   1           93
2015-06-03   153   1           93
2015-06-04   154   1           93
2015-06-05   155   1           93

最后一列给出了回溯窗口的大小（以天为单位），从该日期开始向后看，包括开始日期和结束日期。在

在2016-05-31之前看“3个月”会让你在2015-02-31找到，但2015年2月只有28天。正如您在上述健全性检查中的序列90, 91, 92, 93中看到的，这种DateOffset方法将5月的最后四天映射到2月的最后一天：

^{pr2}$

我不知道这是否与SQL的行为相匹配，但无论如何，您都需要测试一下，并确定这在您的情况下是否有意义。在

网友

3楼 · 编辑于 2024-06-06 16:14:30

你可以用lambda来实现。在

table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)

我们应该为

完整的例子是

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章