pandas 分组过滤，删除某些组

2 投票

3 回答

2844 浏览

提问于 2025-04-18 13:25

我有一个分组对象

grouped = df.groupby('name')
for k,group in grouped:    
    print group

里面有三个组：bar、foo和foobar

  name  time  
2  bar     5  
3  bar     6  


  name  time  
0  foo     5  
1  foo     2  

  name      time  
4  foobar     20  
5  foobar     1

我需要筛选这些组，去掉所有没有超过5的时间的组。在我的例子中，组foo应该被去掉。我想用filter()这个函数来实现这个操作

grouped.filter(lambda x: (x.max()['time']>5))

但是x显然不仅仅是数据框格式的组。

数据处理数据分析 pandas库时间筛选 filter函数分组过滤

3 个回答

根据条件过滤分组，返回一个被过滤的组的列表或字典。例如，返回长度大于等于5的组的列表或字典。

返回一个元组的列表：

[(name,gdf) for name,gdf in df.groupby('Declarer') if len(gdf) >= 5]

返回一个字典：

{name:gdf for name,gdf in df.groupby('Declarer') if len(gdf) >= 5}

回答于 2025-04-18 由 Python大师

分享举报

我还不太习惯使用Python、Numpy或Pandas。不过我在研究一个类似的问题，所以我想用这个问题作为例子来分享我的一些想法。

import pandas as pd

df = pd.DataFrame()
df['name'] = ['foo', 'foo', 'bar', 'bar', 'foobar', 'foobar']
df['time'] = [5, 2, 5, 6, 20, 1]

grouped = df.groupby('name')
for k, group in grouped:
    print(group)

我的答案 1:

indexes_should_drop = grouped.filter(lambda x: (x['time'].max() <= 5)).index
result1 = df.drop(index=indexes_should_drop)

我的答案 2:

filter_time_max = grouped['time'].max() > 5
groups_should_keep = filter_time_max.loc[filter_time_max].index
result2 = df.loc[df['name'].isin(groups_should_keep)]

我的答案 3:

filter_time_max = grouped['time'].max() <= 5
groups_should_drop = filter_time_max.loc[filter_time_max].index
result3 = df.drop(df[df['name'].isin(groups_should_drop)].index)

结果

    name    time
2   bar     5
3   bar     6
4   foobar  20
5   foobar  1

要点

我的答案1没有使用组名来删除组。如果你需要组名，可以通过写：df.loc[indexes_should_drop].name.unique()来获取。

grouped['time'].max() <= 5和grouped.apply(lambda x: (x['time'].max() <= 5)).index返回的结果是一样的。

filter_time_max的索引是一个组名。它不能直接用作索引或标签来删除。

name
foo        True
bar       False
foobar    False
Name: time, dtype: bool

回答于 2025-04-18 由 Python大师

分享举报

假设你最后一行代码应该是 >5 而不是 >20，你可以这样做：

grouped.filter(lambda x: (x.time > 5).any())

正如你正确指出的，x 实际上是一个 DataFrame，它包含了所有在 name 列中与你在循环中使用的 k 匹配的索引。

所以你想要根据时间列中是否有大于5的值来进行筛选，你可以用 (x.time > 5).any() 来测试一下。

回答于 2025-04-18 由 Python大师

分享举报

pandas 分组过滤，删除某些组

3 个回答

我的答案 1:

我的答案 2:

我的答案 3:

结果

要点

撰写回答