根据整行数据屏蔽Pandas DataFrame行

3 投票

2 回答

6254 浏览

提问于 2025-04-18 07:16

背景：

我正在处理8波段的多光谱卫星图像，并根据反射值来估算水深。使用statsmodels，我建立了一个普通最小二乘（OLS）模型，可以根据每个像素的8个反射值来预测水深。为了方便使用这个OLS模型，我把所有像素的反射值放进了一个pandas数据框，格式如下所示；每一行代表一个像素，每一列是多光谱图像的一个光谱波段。

由于一些预处理步骤，所有岸上的像素都变成了全零。我不想预测这些像素的“深度”，所以我想把OLS模型的预测限制在那些不是全零值的行。

我需要把结果重新调整回原始图像的行x列维度，所以不能简单地删除全零的行。

具体问题：

我有一个Pandas数据框。有些行全是零。我想在一些计算中屏蔽这些行，但又需要保留这些行。我不知道怎么把全零的行的所有条目都屏蔽掉。

举个例子：

In [1]: import pandas as pd
In [2]: import numpy as np
        # my actual data has about 16 million rows so
        # I'll simulate some data for the example. 
In [3]: cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
In [4]: rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
In [5]: zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
In [6]: df = pd.concat((rdf,zdf)).reset_index(drop=True)
In [7]: df
Out[7]: 
        band1  band2  band3  band4  band5  band6  band7  band8
    0       9      9      8      7      2      7      5      6
    1       7      7      5      6      3      0      9      8
    2       5      4      3      6      0      3      8      8
    3       6      4      5      0      5      7      4      5
    4       8      3      2      4      1      3      2      5
    5       9      7      6      3      8      7      8      4
    6       6      2      8      2      2      6      9      8
    7       9      4      0      2      7      6      4      8
    8       1      3      5      3      3      3      0      1
    9       4      2      9      7      3      5      5      0
    10      0      0      0      0      0      0      0      0
    11      0      0      0      0      0      0      0      0
    12      0      0      0      0      0      0      0      0

    [13 rows x 8 columns]

我知道可以通过以下方式获取我感兴趣的行：

In [8]: df[df.any(axis=1)==True]
Out[8]: 
       band1  band2  band3  band4  band5  band6  band7  band8
    0      9      9      8      7      2      7      5      6
    1      7      7      5      6      3      0      9      8
    2      5      4      3      6      0      3      8      8
    3      6      4      5      0      5      7      4      5
    4      8      3      2      4      1      3      2      5
    5      9      7      6      3      8      7      8      4
    6      6      2      8      2      2      6      9      8
    7      9      4      0      2      7      6      4      8
    8      1      3      5      3      3      3      0      1
    9      4      2      9      7      3      5      5      0

   [10 rows x 8 columns]

但我稍后需要重新调整数据，所以我需要这些行保持在正确的位置。我尝试了各种方法，包括 df.where(df.any(axis=1)==True)，但我找不到有效的解决方案。

失败的尝试：

df.any(axis=1)==True 对我感兴趣的行返回 True，而对我想屏蔽的行返回 False，但是当我尝试 df.where(df.any(axis=1)==True) 时，我得到的还是整个数据框，包括所有的零。我想要的是整个数据框，但那些全零行的值都被屏蔽，所以我理解的应该显示为Nan，对吧？
我尝试获取全零行的索引并按行屏蔽：
```
mskidxs = df[df.any(axis=1)==False].index
df.mask(df.index.isin(mskidxs))
```
但这也没有成功，得到的是：
```
ValueError: Array conditional must be same shape as self
```
这个 .index 只是返回了一个 Int64Index。我需要一个和我的数据框同样维度的布尔数组，但我就是想不出怎么得到一个。

提前感谢你的帮助。

-Jared

数据框数据预处理维度调整多光谱卫星图像反射值水深估算普通最小二乘模型行屏蔽

2 个回答

我不太明白为什么你不能只对部分行进行计算：

np.average(df[1][:11])

这样可以排除那些值为零的行。

或者你可以只对一部分数据进行计算，然后把计算出来的值再放回原来的数据框中：

dfs = df[:10]
dfs['1_deviation_from_mean'] = pd.Series([abs(np.average(dfs[1]) - val) for val in dfs[1]])
df['deviation_from_mean'] = dfs['1_deviation_from_mean']

另外，你还可以创建一个你想要屏蔽的索引点的列表，然后使用numpy的掩码数组进行计算，这可以通过使用np.ma.masked_where()方法来实现，并指定要屏蔽的索引位置的值：

row_for_mask = [row for row in df.index if all(df.loc[row] == 0)]
masked_array = np.ma.masked_where(df[1].index.isin(row_for_mask), df[1])
np.mean(masked_array)

掩码数组看起来是这样的：

Name: 1, dtype: float64(data =
0      5
1      0
2      0
3      4
4      4
5      4
6      3
7      1
8      0
9      9
10    --
11    --
12    --
Name: 1, dtype: object,

回答于 2025-04-18 由 Python大师

分享举报

我在理清我的问题的过程中，间接找到了答案。这个问题也让我找到了正确的方向。以下是我总结的内容：

import pandas as pd
# Set up my fake test data again. My actual data is described
# in the question.
cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
df = pd.concat((zdf,rdf)).reset_index(drop=True)

# View the dataframe. (sorry about the alignment, I don't
# want to spend the time putting in all the spaces)
df

    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns

# This is essentially the same as item #2 under Fails
# in my question. It gives me the indexes of the rows
# I want unmasked as True and those I want masked as
# False. However, the result is not the right shape to
# use as a mask.
df.apply( lambda row: any([i<>0 for i in row]),axis=1 )
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
dtype: bool

# This is what actually works. By setting broadcast to
# True, I get a result that's the right shape to use.
land_rows = df.apply( lambda row: any([i<>0 for i in row]),axis=1, 
                      broadcast=True )

land_rows

Out[92]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   1   1   1   1   1   1   1   1
4   1   1   1   1   1   1   1   1
5   1   1   1   1   1   1   1   1
6   1   1   1   1   1   1   1   1
7   1   1   1   1   1   1   1   1
8   1   1   1   1   1   1   1   1
9   1   1   1   1   1   1   1   1
10  1   1   1   1   1   1   1   1
11  1   1   1   1   1   1   1   1
12  1   1   1   1   1   1   1   1

13 rows × 8 columns

# This produces the result I was looking for:
df.where(land_rows)

Out[93]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns

再次感谢那些帮助过我的人。希望我找到的解决方案能在某个时刻对某些人有用。

我还找到了一种不同的方法来做同样的事情。虽然步骤更多，但根据 %timeit 的测试，这种方法大约快了9倍。下面是这个方法：

def mask_all_zero_rows_numpy(df):
    """
    Take a dataframe, find all the rows that contain only zeros
    and mask them. Return a dataframe of the same shape with all
    Nan rows in place of the all zero rows.
    """
    no_data = -99
    arr = df.as_matrix().astype(int16)
    # make a row full of the 'no data' value
    replacement_row = np.array([no_data for x in range(arr.shape[1])], dtype=int16)
    # find out what rows are all zeros
    mask_rows = ~arr.any(axis=1)
    # replace those all zero rows with all 'no_data' rows
    arr[mask_rows] = replacement_row
    # create a masked array with the no_data value masked
    marr = np.ma.masked_where(arr==no_data,arr)
    # turn masked array into a data frame
    mdf = pd.DataFrame(marr,columns=df.columns)
    return mdf

调用 mask_all_zero_rows_numpy(df) 的结果应该和上面的 Out[93]: 一样。