在Pandas面板中使用布尔掩码

7 投票
1 回答
8400 浏览
提问于 2025-04-17 14:27

我在处理一个面板的时候遇到了一些麻烦,想要像处理数据框(DataFrame)那样来遮罩(mask)它。我觉得这个操作应该很简单,但在查阅文档和网上论坛时没有找到合适的方法。下面是一个简单的例子:

import pandas
import numpy as np
import datetime
start_date = datetime.datetime(2009,3,1,6,29,59)
r = pandas.date_range(start_date, periods=12)
cols_1 = ['AAPL', 'AAPL', 'GOOG', 'GOOG', 'GS', 'GS']
cols_2 = ['close', 'rate', 'close', 'rate', 'close', 'rate']
dat = np.random.randn(12, 6)

dftst = pandas.DataFrame(dat, columns=pandas.MultiIndex.from_arrays([cols_1, cols_2], names=['ticker','field']), index=r)
pn = dftst.T.to_panel().transpose(2,0,1)
print pn

Out[14]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis)
Items axis: close to rate
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59
Minor_axis axis: AAPL to GS

现在我有了一个面板对象,如果我在项目轴上切片,就能得到一个数据框(DataFrame)。

close_p = pn['close']
print close_p

Out[16]: 
ticker                   AAPL      GOOG        GS
2009-03-01 06:29:59 -0.082203 -0.286354  1.227193
2009-03-02 06:29:59  0.340005 -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567  0.321858 -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504  0.188372  1.311262
2009-03-06 06:29:59  0.272883  0.817179  0.584664
2009-03-07 06:29:59 -1.767227  1.168876  0.443096
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59  0.851820  0.068740  0.566537
2009-03-10 06:29:59  0.390678 -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59  0.067498 -0.764343  0.497270

我可以用两种方式来过滤这些数据:

1) 我创建一个遮罩,然后像下面这样对数据进行遮罩:

msk = close_p > 0
close_p = close_p.mask(msk)

2) 我也可以直接用上面遮罩中的布尔运算符来切片。

close_p = close_p[close_p > 0]
Out[28]: 
ticker                   AAPL      GOOG        GS
2009-03-01 06:29:59       NaN       NaN  1.227193
2009-03-02 06:29:59  0.340005       NaN       NaN
2009-03-03 06:29:59       NaN  0.321858       NaN
2009-03-04 06:29:59       NaN       NaN       NaN
2009-03-05 06:29:59       NaN  0.188372  1.311262
2009-03-06 06:29:59  0.272883  0.817179  0.584664
2009-03-07 06:29:59       NaN  1.168876  0.443096
2009-03-08 06:29:59       NaN       NaN       NaN
2009-03-09 06:29:59  0.851820  0.068740  0.566537
2009-03-10 06:29:59  0.390678       NaN       NaN
2009-03-11 06:29:59       NaN       NaN       NaN
2009-03-12 06:29:59  0.067498       NaN  0.497270

我现在搞不清楚的是,如何在不使用循环的情况下,根据遮罩来过滤我的所有数据。我可以这样做:

msk = (pn['rate'] > 0) & (pn['close'] > 0)
def mask_panel(pan, msk):
    for item in pan.items:
        pan[item] = pan[item].mask(msk)
    return pan
print pn['close']

Out[32]: 
ticker                   AAPL      GOOG        GS
2009-03-01 06:29:59 -0.082203 -0.286354  1.227193
2009-03-02 06:29:59  0.340005 -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567  0.321858 -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504  0.188372  1.311262
2009-03-06 06:29:59  0.272883  0.817179  0.584664
2009-03-07 06:29:59 -1.767227  1.168876  0.443096
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59  0.851820  0.068740  0.566537
2009-03-10 06:29:59  0.390678 -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59  0.067498 -0.764343  0.497270

mask_panel(pn, msk)

print pn['close']

Out[34]: 
ticker                   AAPL      GOOG        GS
2009-03-01 06:29:59 -0.082203 -0.286354       NaN
2009-03-02 06:29:59       NaN -0.688933 -1.505137
2009-03-03 06:29:59 -0.525567       NaN -0.035047
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523
2009-03-05 06:29:59 -0.407504       NaN       NaN
2009-03-06 06:29:59       NaN       NaN       NaN
2009-03-07 06:29:59 -1.767227       NaN       NaN
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906
2009-03-09 06:29:59       NaN       NaN       NaN
2009-03-10 06:29:59       NaN -0.012422 -0.152375
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091
2009-03-12 06:29:59       NaN -0.764343       NaN

上面的循环可以解决问题。我知道有一种更快的方法可以使用ndarray来实现,但我还没有弄明白。此外,这似乎应该是pandas库中自带的功能。如果我遗漏了什么方法,任何建议都非常感谢。

1 个回答

9

我觉得这样做是可行的(而且 Panel.where 应该要做的事情),不过这有点复杂,因为它需要处理很多不同的情况。

# construct the mask in 2-d (a frame)
In [36]: mask = (pn['close']>0) & (pn['rate']>0)

In [37]: mask
Out[37]: 
ticker                AAPL   GOOG     GS
2009-03-01 06:29:59  False  False  False
2009-03-02 06:29:59  False  False   True
....

# here's the key, this broadcasts, setting the values which 
# don't meet the condition to nan
In [38]: masked_values = np.where(mask,pn.values,np.nan)

# reconstruct the panel (the _construct_axes_dict is an internal function that returns
# dict of the axes, e.g. items -> the items, major_axis -> .....
In [42]: x = pd.Panel(masked_values,**pn._construct_axes_dict())
Out[42]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis)
Items axis: close to rate
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59
Minor_axis axis: AAPL to GS

# the values
In [43]: x
Out[43]: 
array([[[        nan,         nan,         nan],
    [        nan,         nan,  0.09575723],
    [        nan,         nan,         nan],
    [        nan,         nan,         nan],
    [        nan,  2.07229823,  0.04347515],
    [        nan,         nan,         nan],
    [        nan,         nan,  2.18342239],
    [        nan,         nan,  1.73674381],
    [        nan,  2.01173087,         nan],
    [ 0.24109645,  0.94583072,         nan],
    [ 0.36953467,         nan,  0.18044432],
    [ 1.74164222,  1.02314752,  1.73736033]],

   [[        nan,         nan,         nan],
    [        nan,         nan,  0.06960387],
    [        nan,         nan,         nan],
    [        nan,         nan,         nan],
    [        nan,  0.63202199,  0.56724391],
    [        nan,         nan,         nan],
    [        nan,         nan,  0.71964824],
    [        nan,         nan,  1.03482927],
    [        nan,  0.18256148,         nan],
    [ 1.29451667,  0.49804327,         nan],
    [ 2.04726538,         nan,  0.12883128],
    [ 0.70647885,  0.7277734 ,  0.77844475]]])

撰写回答