按列和重采样日期分组,得到其他列的滚动和

2024-03-29 10:58:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据:

(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})

(Pdb) df1
             id val       date
0  SE0000195570   1 2014-10-23
1  SE0000195570   2 2014-07-16
2  SE0000195570   3 2014-04-29
3  SE0000195570   4 2014-01-31
4  SE0000191827   5 2018-10-19
5  SE0000191827   6 2018-07-11
6  SE0000191827   7 2018-04-20
7  SE0000191827   8 2018-02-16
8  SE0000191827   9 2018-12-29

更新: 根据@user3483203的建议,我已经做了一些进一步的工作,但还不太清楚。我已经用新行修改了上面的示例数据,以便更好地说明。你知道吗

(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
                                   id  val       date  calc
id           date                                          
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16  26.0
             2018-03-31           NaN  NaN        NaT   NaN
             2018-04-30  SE0000191827    7 2018-04-20  27.0
             2018-05-31           NaN  NaN        NaT   NaN
             2018-06-30           NaN  NaN        NaT   NaN
             2018-07-31  SE0000191827    6 2018-07-11   NaN
             2018-08-31           NaN  NaN        NaT   NaN
             2018-09-30           NaN  NaN        NaT   NaN
             2018-10-31  SE0000191827    5 2018-10-19   NaN
             2018-11-30           NaN  NaN        NaT   NaN
             2018-12-31  SE0000191827    9 2018-12-29   NaN
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31  10.0
             2014-02-28           NaN  NaN        NaT   NaN
             2014-03-31           NaN  NaN        NaT   NaN
             2014-04-30  SE0000195570    3 2014-04-29   NaN
             2014-05-31           NaN  NaN        NaT   NaN
             2014-06-30           NaN  NaN        NaT   NaN
             2014-07-31  SE0000195570    2 2014-07-16   NaN
             2014-08-31           NaN  NaN        NaT   NaN
             2014-09-30           NaN  NaN        NaT   NaN
             2014-10-31  SE0000195570    1 2014-10-23   NaN

根据我的要求,行(SE00001918272018-03-31)应该有一个计算值,因为它有四个连续的行有一个值。目前该行正在用dropna调用删除,我不知道如何解决这个问题。你知道吗


我需要什么

计算:我的初始数据中的日期是季度日期。但是,我需要将这些数据转换成每月的行,范围在每个id的第一个和最后一个日期之间,并为每个月计算该id内输入数据的四个最接近的连续行的总和。那是一口。这让我想到resample。请参阅下面的预期输出。我需要的数据被分组的id和每月日期。你知道吗

性能:我现在测试的数据只是为了进行基准测试,但我需要解决方案来实现性能。我希望在超过10万个惟一的id上运行它,这可能会导致大约1000万行。(10万个id,日期可追溯到10年,10年*12个月=每个id 120个月,10万*120=1200万行)。你知道吗

我试过的

(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
                                   id  val       date
id           date                                    
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16
             2018-03-31           NaN  NaN        NaT
             2018-04-30  SE0000191827    7 2018-04-20
             2018-05-31           NaN  NaN        NaT
             2018-06-30           NaN  NaN        NaT
             2018-07-31  SE0000191827    6 2018-07-11
             2018-08-31           NaN  NaN        NaT
             2018-09-30           NaN  NaN        NaT
             2018-10-31  SE0000191827    5 2018-10-19
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31
             2014-02-28           NaN  NaN        NaT
             2014-03-31           NaN  NaN        NaT
             2014-04-30  SE0000195570    3 2014-04-29
             2014-05-31           NaN  NaN        NaT
             2014-06-30           NaN  NaN        NaT
             2014-07-31  SE0000195570    2 2014-07-16
             2014-08-31           NaN  NaN        NaT
             2014-09-30           NaN  NaN        NaT
             2014-10-31  SE0000195570    1 2014-10-23

这个数据看起来非常适合我的案例,因为它按照id进行了很好的分组,并且按月份排列了date。在这里,我似乎可以使用df['val'].rolling(4)之类的内容,确保它跳过NaN值,并将结果放入一个新列中。你知道吗

预期输出(新列calc):

                                   id  val       date  calc
id           date                                    
SE0000191827 2018-02-28  SE0000191827    8 2018-02-16    26
             2018-03-31           NaN  NaN        NaT
             2018-04-30  SE0000191827    7 2018-04-20   NaN
             2018-05-31           NaN  NaN        NaT
             2018-06-30           NaN  NaN        NaT
             2018-07-31  SE0000191827    6 2018-07-11   NaN
             2018-08-31           NaN  NaN        NaT
             2018-09-30           NaN  NaN        NaT
             2018-10-31  SE0000191827    5 2018-10-19   NaN
SE0000195570 2014-01-31  SE0000195570    4 2014-01-31    10
             2014-02-28           NaN  NaN        NaT
             2014-03-31           NaN  NaN        NaT
             2014-04-30  SE0000195570    3 2014-04-29   NaN
             2014-05-31           NaN  NaN        NaT
             2014-06-30           NaN  NaN        NaT
             2014-07-31  SE0000195570    2 2014-07-16   NaN
             2014-08-31           NaN  NaN        NaT
             2014-09-30           NaN  NaN        NaT
             2014-10-31  SE0000195570    1 2014-10-23   NaN
             2014-11-30           NaN  NaN        NaT
             2014-12-31  SE0000195570    1 2014-10-23   NaN

这里calc中的结果是26,因为它加上了前面的三个(8+7+6+5)。其余的id是NaN,因为四个值不可用。你知道吗

问题

虽然看起来数据是按iddate分组的,但实际上似乎是按date分组的。我不知道这是怎么回事。我需要数据按id和日期分组。你知道吗

(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7    6.730000e+08
Name: val, dtype: object

上面resample的结果返回一个DatetimeIndexResamplerGroupby,它没有rolling。。。你知道吗

(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'

怎么办?我猜我的方法是错误的,但是在浏览了文档之后,我不知道从哪里开始。你知道吗


Tags: 数据iddatecalcresvalnannat