如何在python Pandas中进行条件联接?

2024-04-27 17:25:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图根据存储在单独表中的日期值计算熊猫中基于时间的聚合。

第一张桌子的顶部是这样的:

    COMPANY_ID  DATE            MEASURE
    1   2010-01-01 00:00:00     10
    1   2010-01-02 00:00:00     10
    1   2010-01-03 00:00:00     10
    1   2010-01-04 00:00:00     10
    1   2010-01-05 00:00:00     10

下面是创建表的代码:

    table_a = pd.concat(\
    [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 1 , 'MEASURE': 10}),\
    pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 2 , 'MEASURE': 10})])

第二张桌子,桌子b看起来是这样的:

        COMPANY     END_DATE
        1   2010-03-01 00:00:00
        1   2010-06-02 00:00:00
        2   2010-03-01 00:00:00
        2   2010-06-02 00:00:00

创建它的代码是:

    table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
                    'COMPANY':(1,1,2,2)})

我希望能够在表b中的结束日期之前的每30天内获取每个公司ID的度量值列的总和

这(我认为)相当于SQL:

      select
 b.COMPANY_ID,
 b.DATE
 sum(a.MEASURE) AS MEASURE_TO_END_DATE
 from table_a a, table_b b
 where a.COMPANY = b.COMPANY and
       a.DATE < b.DATE and
       a.DATE > b.DATE - 30  
 group by b.COMPANY;

谢谢你的帮助


Tags: and代码iddataframedate时间tablerange
1条回答
网友
1楼 · 发布于 2024-04-27 17:25:35

好吧,我可以想出一些办法。(1) 基本上是在company上合并来放大数据帧,然后在合并后的30天窗口上过滤。这应该很快,但可能会占用大量内存。(2) 将30天窗口上的合并和筛选移动到groupby中。这将导致每个组的合并,因此速度会慢一些,但应该使用较少的内存

选项1

假设您的数据如下所示(我扩展了您的示例数据):

print df

    company       date  measure
0         0 2010-01-01       10
1         0 2010-01-15       10
2         0 2010-02-01       10
3         0 2010-02-15       10
4         0 2010-03-01       10
5         0 2010-03-15       10
6         0 2010-04-01       10
7         1 2010-03-01        5
8         1 2010-03-15        5
9         1 2010-04-01        5
10        1 2010-04-15        5
11        1 2010-05-01        5
12        1 2010-05-15        5

print windows

   company   end_date
0        0 2010-02-01
1        0 2010-03-15
2        1 2010-04-01
3        1 2010-05-15

为30天窗口创建开始日期:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))
print windows

   company   end_date   beg_date
0        0 2010-02-01 2010-01-02
1        0 2010-03-15 2010-02-13
2        1 2010-04-01 2010-03-02
3        1 2010-05-15 2010-04-15

现在进行合并,然后根据date是否在beg_dateend_date范围内进行选择:

df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df

    company       date  measure   end_date   beg_date
2         0 2010-01-15       10 2010-02-01 2010-01-02
4         0 2010-02-01       10 2010-02-01 2010-01-02
7         0 2010-02-15       10 2010-03-15 2010-02-13
9         0 2010-03-01       10 2010-03-15 2010-02-13
11        0 2010-03-15       10 2010-03-15 2010-02-13
16        1 2010-03-15        5 2010-04-01 2010-03-02
18        1 2010-04-01        5 2010-04-01 2010-03-02
21        1 2010-04-15        5 2010-05-15 2010-04-15
23        1 2010-05-01        5 2010-05-15 2010-04-15
25        1 2010-05-15        5 2010-05-15 2010-04-15

您可以通过对companyend_date分组来计算30天的窗口和:

print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

选项#2将所有合并移动到groupby中。这对记忆应该更好,但我想得慢得多:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))

def cond_merge(g,windows):
    g = g.merge(windows,on='company',how='left')
    g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
    return g.groupby('end_date')['measure'].sum()

print df.groupby('company').apply(cond_merge,windows)

company  end_date  
0        2010-02-01    20
         2010-03-15    30
1        2010-04-01    10
         2010-05-15    15

另一个选项现在,如果您的窗口从未重叠(如示例数据中所示),您可以执行以下操作,作为不会炸毁数据帧但速度非常快的替代方法:

windows['date'] = windows['end_date']

df = df.merge(windows,on=['company','date'],how='outer')
print df

    company       date  measure   end_date
0         0 2010-01-01       10        NaT
1         0 2010-01-15       10        NaT
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10        NaT
4         0 2010-03-01       10        NaT
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5        NaT
8         1 2010-03-15        5        NaT
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5        NaT
11        1 2010-05-01        5        NaT
12        1 2010-05-15        5 2010-05-15

此合并实质上是将窗口结束日期插入到数据框中,然后(按组)填充结束日期,这将为您提供一个结构,以便轻松创建汇总窗口:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())

print df

    company       date  measure   end_date
0         0 2010-01-01       10 2010-02-01
1         0 2010-01-15       10 2010-02-01
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10 2010-03-15
4         0 2010-03-01       10 2010-03-15
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5 2010-04-01
8         1 2010-03-15        5 2010-04-01
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5 2010-05-15
11        1 2010-05-01        5 2010-05-15
12        1 2010-05-15        5 2010-05-15

df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
                   np.timedelta64(30,'D'))

print df

   company       date  measure   end_date   beg_date
0         0 2010-01-01       10 2010-02-01 2010-01-02
1         0 2010-01-15       10 2010-02-01 2010-01-02
2         0 2010-02-01       10 2010-02-01 2010-01-02
3         0 2010-02-15       10 2010-03-15 2010-02-13
4         0 2010-03-01       10 2010-03-15 2010-02-13
5         0 2010-03-15       10 2010-03-15 2010-02-13
7         1 2010-03-01        5 2010-04-01 2010-03-02
8         1 2010-03-15        5 2010-04-01 2010-03-02
9         1 2010-04-01        5 2010-04-01 2010-03-02
10        1 2010-04-15        5 2010-05-15 2010-04-15
11        1 2010-05-01        5 2010-05-15 2010-04-15
12        1 2010-05-15        5 2010-05-15 2010-04-15

df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

另一种方法是将第一个数据帧重新采样为每日数据,然后使用30天的窗口计算滚动和;并在您感兴趣的结尾选择日期。这也可能是相当密集的记忆。

相关问题 更多 >