根据索引和Series值过滤Dataframe

0 投票

3 回答

55 浏览

提问于 2025-04-14 18:25

我正在尝试使用一个序列来过滤数据框中的特定行。举个例子，

import pandas as pd
filter_ = pd.Series(
    data = [1, 9, 10],
    index = ['A', 'B', 'C']
    )
filter_.index.name = 'my_id'

df = pd.DataFrame({
    'A': [1, 2, 9, 4],
    'B': [9, 6, 7, 8],
    'C': [10, 91, 32, 13],
    'D': [43, 12, 7, 9],
    'E': [65, 12, 3, 8]
})

我现在想过滤数据框，以便获取序列的索引和值与数据框的列和值匹配的那一行，也就是说，结果应该是

    A   B   C   D   E
0   1   9   10  43  65

根据我的观察，.query() 方法不适用，因为我的实际数据框中有一个值是 pathlib 对象。

像 df.loc[df[filter.index[0]] == filter.iloc[0]] 这样的写法可以工作，但显然只是在索引 A 上过滤，而我希望能够动态地对序列中的所有索引/值对进行过滤。

编辑：一些给出的解决方案在我的实际数据框中不起作用，因为某些列中包含了 None 值。我之前提供了一个更完整的输出，显示了完整示例和最小可重现示例之间的区别。

时间测试

总结一下：Panda 提供的方法在处理大数据框时最快，而 TheHungryCub 的答案在处理非常小的数据框时最快，但随着复杂度的增加，速度会显著下降。

我现在已经测试了 Panda Kim 下面给出的示例，包括 None 值：

import pandas as pd
filter_ = pd.Series(
data = [1, 9, None],
index = ['A', 'B', 'C'], dtype='object'
)
filter_.index.name = 'my_id'

df = pd.DataFrame({
    'A': [1, 2, 9, 4],
    'B': [9, 6, 7, 8],
    'C': [None, None, None, None],
    'D': [43, 12, 7, 9],
    'E': [65, 12, 3, 8]
})

使用 %%timeit，时间测试结果如下：

col1 = filter_.index[filter_.notna()]
col2 = filter_.index.difference(col1)
cond = df[col1].eq(filter_[col1]).all(axis=1) & df[col2].isna().all(axis=1)
out = df[cond]

每次循环 911 微秒 ± 26.2 微秒（7 次运行的平均值 ± 标准差，每次 1,000 次循环）

out = df[df[filter_.index].fillna(object).eq(filter_.fillna(object)).all(axis=1)]

每次循环 421 微秒 ± 6.01 微秒（7 次运行的平均值 ± 标准差，每次 1,000 次循环）

out = df[df.apply(lambda row: all(row[index] == value for index, value in filter_.items()), axis=1)]

每次循环 200 微秒 ± 5.97 微秒（7 次运行的平均值 ± 标准差，每次 1,000 次循环）

因此，似乎使用 .apply() 实际上是最快的。将 + list(range(1000)) 添加到数据框的每一列（使其总共有 10004 行）显示出相当不同的结果，第一个方法耗时 1.31 毫秒，第二个方法耗时 1.53 毫秒，而第三个方法（.apply()）耗时 43.5 毫秒。扩展到 100 万行时，第一个方法耗时 38.2 毫秒，第二个方法 125 毫秒，而第三个方法耗时 4.55 秒。

过滤数据处理时间复杂度索引性能测试动态过滤 pandas 数据框

3 个回答

你可以使用 .all() 来检查布尔系列中的所有元素在每一行是否都是 True。

import pandas as pd

filter_ = pd.Series(
    data=[1, 9, 10],
    index=['A', 'B', 'C']
)
filter_.index.name = 'my_id'

df = pd.DataFrame({
    'A': [1, 2, 9, 4],
    'B': [9, 6, 7, 8],
    'C': [10, 91, 32, 13],
    'D': [43, 12, 7, 9],
    'E': [65, 12, 3, 8]
})

filtered_rows = df[df.apply(lambda row: all(row[index] == value for index, value in filter_.items()), axis=1)]

print(filtered_rows)

输出结果：

   A  B   C   D   E
0  1  9  10  43  65

回答于 2025-04-14 由 Python大师

分享举报

你可以用 filter_ 的索引来切片 df，然后对比这两个部分，最后用 all 来进行布尔索引：

out = df[df[filter_.index].eq(filter_).all(axis=1)]

如果你的数据里可能有 NaN（缺失值），那么在比较之前，可以用一个虚拟对象（比如 object）来替代它们：

out = df[df[filter_.index].fillna(object).eq(filter_.fillna(object)).all(axis=1)]

输出结果：

   A  B   C   D   E
0  1  9  10  43  65

回答于 2025-04-14 由 Python大师

分享举报

我觉得问题出在空值上。看起来我们需要单独处理这些空值，以便进行向量化操作。

让我们重新生成一个例子，包含空值。

新的例子

import pandas as pd
filter_ = pd.Series(
    data = [1, 9, None],
    index = ['A', 'B', 'C'], dtype='object'
)
filter_.index.name = 'my_id'

df = pd.DataFrame({
    'A': [1, 2, 9, 4],
    'B': [9, 6, 7, 8],
    'C': [None, None, None, None],
    'D': [43, 12, 7, 9],
    'E': [65, 12, 3, 8]
})

代码

col1 = filter_.index[filter_.notna()]
col2 = filter_.index.difference(col1)
cond = df[col1].eq(filter_[col1]).all(axis=1) & df[col2].isna().all(axis=1)
out = df[cond]

输出

    A   B   C       D   E
0   1   9   None    43  65

回答于 2025-04-14 由 Python大师

分享举报

根据索引和Series值过滤Dataframe

时间测试

3 个回答

撰写回答