Pandas布尔索引的逻辑运算符

3条回答

网友

1楼 · 编辑于 2024-05-15 02:13:52

Logical operators for boolean indexing in Pandas

重要的是要认识到，不能在pandas.Series或pandas.DataFrame上使用任何Python逻辑运算符（and、or或not）（类似地，不能在具有多个元素的numpy.array上使用它们）。您不能使用它们的原因是，它们隐式地调用操作数上的bool，这会引发异常，因为这些数据结构决定了数组的布尔值是不明确的：

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.array([1,2,3])
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame([1,2,3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> bool(s)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> bool(df)
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我确实更广泛地讨论过这个问题。

NumPys逻辑函数

但是NumPy提供了这些运算符的元素级操作等价物，作为可以在numpy.array、pandas.Series、pandas.DataFrame或任何其他（一致的）numpy.array子类上使用的函数：

and有^{}
or有^{}
not有^{}
^{}它没有Python等价物，但是是一个逻辑的"exclusive or"操作

因此，本质上，应该使用（假设df1和df2是pandas数据帧）：

np.logical_and(df1, df2)
np.logical_or(df1, df2)
np.logical_not(df1)
np.logical_xor(df1, df2)

布尔值的位函数和位运算符

但是，如果您有布尔型NumPy数组、pandas系列或pandas数据帧，则还可以使用element-wise bitwise functions（对于布尔型，它们与逻辑函数或至少应该是不可区分的）：

按位与：^{}或&运算符
按位或：^{}或|运算符
按位不：^{}（或别名np.bitwise_not）或~运算符
按位异或：^{}或^运算符

通常使用运算符。但是，当与比较运算符组合时，必须记住将比较括在括号中，因为按位运算符有一个higher precedence than the comparison operators：

(df1 < 10) | (df2 > 10)  # instead of the wrong df1 < 10 | df2 > 10

这可能会让人恼火，因为Python逻辑运算符的优先级比比较运算符低，所以通常编写a < 10 and b > 10（其中a和b是简单整数）而不需要括号。

逻辑操作和按位操作之间的差异（在非布尔操作上）

必须强调的是，位和逻辑操作只对布尔NumPy数组（以及布尔序列和数据帧）等效。如果它们不包含布尔值，则操作将给出不同的结果。我将包括使用NumPy数组的示例，但对于pandas数据结构，结果将类似：

>>> import numpy as np
>>> a1 = np.array([0, 0, 1, 1])
>>> a2 = np.array([0, 1, 0, 1])

>>> np.logical_and(a1, a2)
array([False, False, False,  True])
>>> np.bitwise_and(a1, a2)
array([0, 0, 0, 1], dtype=int32)

由于NumPy（和类似的pandas）对boolean（Boolean or “mask” index arrays）和integer（Index arrays）索引做了不同的事情，因此索引的结果也将不同：

>>> a3 = np.array([1, 2, 3, 4])

>>> a3[np.logical_and(a1, a2)]
array([4])
>>> a3[np.bitwise_and(a1, a2)]
array([1, 1, 1, 2])

汇总表

Logical operator | NumPy logical function | NumPy bitwise function | Bitwise operator
-------------------------------------------------------------------------------------
       and       |  np.logical_and        | np.bitwise_and         |        &
-------------------------------------------------------------------------------------
       or        |  np.logical_or         | np.bitwise_or          |        |
-------------------------------------------------------------------------------------
                 |  np.logical_xor        | np.bitwise_xor         |        ^
-------------------------------------------------------------------------------------
       not       |  np.logical_not        | np.invert              |        ~

其中，逻辑运算符不适用于NumPy数组、pandas系列和pandas数据帧。其他的则处理这些数据结构（和普通的Python对象）和工作元素。但是，在普通Python bools上按位反转时要小心，因为bool在这个上下文中将被解释为整数（例如~False返回-1，而~True返回-2）。

网友

2楼 · 编辑于 2024-05-15 02:13:52

当你说

(a['x']==1) and (a['y']==10)

您隐式地要求Python将(a['x']==1)和(a['y']==10)转换为布尔值。

NumPy数组（长度大于1）和Pandas对象（如Series）没有布尔值，换句话说，它们会提高

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

当用作布尔值时。那是因为它的unclear when it should be True or False。一些用户可能会假设它们是真的，如果它们具有非零长度，比如Python列表。其他人可能希望它是真的，只要它的所有元素都是真的。其他人可能希望它是真的，如果它的任何元素都是真的。

因为有太多相互矛盾的期望，NumPy和Pandas的设计者拒绝猜测，反而提出了一个价值错误。

相反，您必须是显式的，通过调用empty()、all()或any()方法来指示您想要的行为。

然而，在这种情况下，看起来您不需要布尔求值，而是需要逻辑和。这就是&二进制运算符执行的操作：

(a['x']==1) & (a['y']==10)

返回布尔数组。

顺便说一下，作为alexpmil notes，括号是必需的，因为&的operator precedence比==高。如果没有圆括号，a['x']==1 & a['y']==10将被计算为a['x'] == (1 & a['y']) == 10，而这又相当于链式比较(a['x'] == (1 & a['y'])) and ((1 & a['y']) == 10)。这是Series and Series形式的表达式。对两个序列使用and将再次触发与上面相同的ValueError。这就是为什么括号是强制性的。

网友

3楼 · 编辑于 2024-05-15 02:13:52

熊猫的TLDR；_{逻辑运算符是&、|和~，括号(...)很重要！}

Python的and、or和not逻辑运算符被设计为与标量一起工作。因此，Pandas必须做得更好，并重写按位运算符，以实现此功能的矢量化版本。

因此，python中的以下表达式（exp1和exp2是求值为布尔结果的表达式）。。。

exp1 and exp2              # Logical AND
exp1 or exp2               # Logical OR
not exp1                   # Logical NOT

…将转换为。。。

exp1 & exp2                # Element-wise logical AND
exp1 | exp2                # Element-wise logical OR
~exp1                      # Element-wise logical NOT

为了熊猫。

如果在执行逻辑操作的过程中得到ValueError，则需要使用括号进行分组：

(exp1) op (exp2)

例如

(df['col1'] == x) & (df['col2'] == y)

等等。

Boolean Indexing：一个常见的操作是通过逻辑条件计算布尔掩码来过滤数据。Pandas提供三个运算符：逻辑与的&，逻辑或的|，逻辑非的~。

考虑以下设置：

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
df

   A  B  C
0  5  0  3
1  3  7  9
2  3  5  2
3  4  7  6
4  8  8  1

逻辑和

对于上面的df，假设您希望返回A<；5和B>；5所在的所有行。这是通过分别计算每个条件的掩码，并对它们进行运算来实现的。

按位重载&运算符
在继续之前，请注意文档的这一特定摘录，其中说明

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

因此，考虑到这一点，可以使用按位运算符&实现元素逻辑和：

df['A'] < 5

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'] > 5

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

(df['A'] < 5) & (df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

接下来的过滤步骤很简单

df[(df['A'] < 5) & (df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

括号用于重写按位运算符的默认优先级顺序，这些运算符的优先级高于条件运算符<和>。请参见python文档中的Operator Precedence部分。

如果不使用括号，则表达式的计算结果不正确。例如，如果您不小心尝试了

df['A'] < 5 & df['B'] > 5

它被解析为

df['A'] < (5 & df['B']) > 5

变成了

df['A'] < something_you_dont_want > 5

它变成（参见chained operator comparison上的python文档）

(df['A'] < something_you_dont_want) and (something_you_dont_want > 5)

变成了

# Both operands are Series...
something_else_you_dont_want1 and something_else_you_dont_want2

它抛出

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

所以，不要犯那个错误！¹

避免括号分组
解决方法其实很简单。大多数运算符对数据帧都有相应的绑定方法。如果单个掩码是使用函数而不是条件运算符构建的，则不再需要按parens分组来指定求值顺序：

df['A'].lt(5)

0     True
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df['B'].gt(5)

0    False
1     True
2    False
3     True
4     True
Name: B, dtype: bool

df['A'].lt(5) & df['B'].gt(5)

0    False
1     True
2    False
3     True
4    False
dtype: bool

请参阅Flexible Comparisons.部分。总而言之，我们有

╒════╤════════════╤════════════╕
│    │ Operator   │ Function   │
╞════╪════════════╪════════════╡
│  0 │ >          │ gt         │
├────┼────────────┼────────────┤
│  1 │ >=         │ ge         │
├────┼────────────┼────────────┤
│  2 │ <          │ lt         │
├────┼────────────┼────────────┤
│  3 │ <=         │ le         │
├────┼────────────┼────────────┤
│  4 │ ==         │ eq         │
├────┼────────────┼────────────┤
│  5 │ !=         │ ne         │
╘════╧════════════╧════════════╛

避免括号的另一个选项是使用^{}（或eval）：

df.query('A < 5 and B > 5')

   A  B  C
1  3  7  9
3  4  7  6

我在Dynamic Expression Evaluation in pandas using pd.eval()中广泛地记录了query和eval。

^{}
允许您以功能方式执行此操作。内部调用对应于按位运算符的Series.__and__。

import operator 

operator.and_(df['A'] < 5, df['B'] > 5)
# Same as,
# (df['A'] < 5).__and__(df['B'] > 5) 

0    False
1     True
2    False
3     True
4    False
dtype: bool

df[operator.and_(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

你通常不需要这个，但知道它是有用的。

泛化：^{}（和logical_and.reduce）
另一种方法是使用np.logical_and，它也不需要括号分组：

np.logical_and(df['A'] < 5, df['B'] > 5)

0    False
1     True
2    False
3     True
4    False
Name: A, dtype: bool

df[np.logical_and(df['A'] < 5, df['B'] > 5)]

   A  B  C
1  3  7  9
3  4  7  6

np.logical_and是一个ufunc (Universal Functions)，大多数ufunc都有一个^{}方法。这意味着，如果有多个到和的掩码，则更容易使用logical_and进行泛化。例如，使用&来屏蔽m1、m2和m3，则必须执行以下操作

m1 & m2 & m3

不过，更简单的选择是

np.logical_and.reduce([m1, m2, m3])

这很强大，因为它可以让您在这个基础上构建更复杂的逻辑（例如，在列表理解中动态生成掩码并添加所有掩码）：

import operator

cols = ['A', 'B']
ops = [np.less, np.greater]
values = [5, 5]

m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
m 
# array([False,  True, False,  True, False])

df[m]
   A  B  C
1  3  7  9
3  4  7  6

我知道我在唠叨这一点，但请容忍我。这是一个很常见的初学者错误，必须解释得非常透彻。

逻辑或

对于上面的df，假设您希望返回A==3或B==7的所有行。

按位重载|

df['A'] == 3

0    False
1     True
2     True
3    False
4    False
Name: A, dtype: bool

df['B'] == 7

0    False
1     True
2    False
3     True
4    False
Name: B, dtype: bool

(df['A'] == 3) | (df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[(df['A'] == 3) | (df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

如果您还没有，请阅读上面关于逻辑和的部分，此处适用所有警告。

或者，此操作可以指定为

df[df['A'].eq(3) | df['B'].eq(7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

^{}
在引擎盖下调用Series.__or__。

operator.or_(df['A'] == 3, df['B'] == 7)
# Same as,
# (df['A'] == 3).__or__(df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
dtype: bool

df[operator.or_(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

^{}
对于两种情况，使用logical_or：

np.logical_or(df['A'] == 3, df['B'] == 7)

0    False
1     True
2     True
3     True
4    False
Name: A, dtype: bool

df[np.logical_or(df['A'] == 3, df['B'] == 7)]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

对于多个遮罩，使用logical_or.reduce：

np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
# array([False,  True,  True,  True, False])

df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]

   A  B  C
1  3  7  9
2  3  5  2
3  4  7  6

逻辑不

给一个面具，比如

mask = pd.Series([True, True, False])

如果需要反转每个布尔值（以便最终结果为[False, False, True]），则可以使用下面的任何方法。

按位~

~mask

0    False
1    False
2     True
dtype: bool

同样，表达式需要用括号括起来。

~(df['A'] == 3)

0     True
1    False
2    False
3     True
4     True
Name: A, dtype: bool

这在内部调用

mask.__invert__()

0    False
1    False
2     True
dtype: bool

但不要直接使用。

operator.inv
内部调用序列上的__invert__。

operator.inv(mask)

0    False
1    False
2     True
dtype: bool

^{}
这是核弹变种。

np.logical_not(mask)

0    False
1    False
2     True
dtype: bool

注，np.logical_and可以用bitwise_or代替np.bitwise_and，logical_or，用invert代替logical_not。

NumPys逻辑函数

布尔值的位函数和位运算符

逻辑操作和按位操作之间的差异（在非布尔操作上）

汇总表

熊猫的TLDR；_{逻辑运算符是&、|和~，括号(...)很重要！}

逻辑和

逻辑或

逻辑不

相关问题更多 >

编程相关推荐

热门问题

热门文章