有效的方法将多个筛选器应用于 Pandas DataFrame 或 Series

长示例

我将从一个当前的例子开始，只过滤一个序列对象。下面是我当前使用的函数：

def apply_relops(series, relops): """ Pass dictionary of relational operators to perform on given series object """ for op, vals in relops.iteritems(): op_func = ops[op] for val in vals: filtered = op_func(series, val) series = series.reindex(series[filtered]) return series

用户向字典提供要执行的操作：

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]}) >>> print df >>> print df col1 col2 0 0 10 1 1 11 2 2 12 >>> from operator import le, ge >>> ops ={'>=': ge, '<=': le} >>> apply_relops(df['col1'], {'>=': [1]}) col1 1 1 2 2 Name: col1 >>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]}) col1 1 1 Name: col1

同样，我上述方法的“问题”在于，我认为中间步骤中可能存在大量不必要的数据复制。

此外，我还想扩展它，以便传入的字典可以包含要操作的列，并根据输入字典筛选整个数据帧。但是，我假设对这个系列有效的东西可以很容易地扩展到一个数据帧。

3条回答

网友

1楼 · 编辑于 2024-04-26 22:26:57

最简单的解决方案：

使用：

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

另一个示例要筛选属于2018年2月的值的数据帧，请使用以下代码

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

网友

2楼 · 编辑于 2024-04-26 22:26:57

熊猫（和numpy）允许boolean indexing，这将更有效：

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

如果要为此编写助手函数，请考虑以下内容：

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

更新：pandas 0.13 has a query method对于这些类型的用例，假设列名是有效的标识符，则可以执行以下操作（对于大型框架，在幕后使用numexpr时效率更高）：

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

网友

3楼 · 编辑于 2024-04-26 22:26:57

链接条件会产生长线，pep8不鼓励这样做。使用.query方法会强制使用字符串，这很强大，但不符合语法，而且不是很动态。

一旦每个过滤器就位，一种方法是

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

logical在上操作并且速度很快，但不接受两个以上的参数，这是由functools.reduce处理的。

请注意，这仍然有一些冗余：a）快捷方式不会在全局级别上发生；b）每个单独的条件都在整个初始数据上运行。不过，我希望这对许多应用程序都足够有效，而且非常可读。

TL；博士

长示例

相关问题更多 >

编程相关推荐

热门问题

热门文章