多重索引的Pandas groupby,忽略某一层?
我正在对一个多重索引的DataFrame进行groupby
操作,类似于下面这个:
0 1 ...
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ... ...
这个DataFrame可以用以下代码生成:
import pandas as pd, numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
["feature1", "feature2"],
["subfeature1", "subfeature2"]],
names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)
我想选择那些两个子特征数组之间的最大差异大于某个阈值的组,但在使用groupby
时遇到了困难。
df.groupby(level=['categories','features'])
这样我得到了以下这些组:
{('cat1', 'feature1'): [('cat1', 'feature1', 'subfeature1'),
('cat1', 'feature1', 'subfeature2')],
('cat1', 'feature2'): [('cat1', 'feature2', 'subfeature1'),
('cat1', 'feature2', 'subfeature2')],
('cat2', 'feature1'): [('cat2', 'feature1', 'subfeature1'),
('cat2', 'feature1', 'subfeature2')],
('cat2', 'feature2'): [('cat2', 'feature2', 'subfeature1'),
('cat2', 'feature2', 'subfeature2')],
('cat3', 'feature1'): [('cat3', 'feature1', 'subfeature1'),
('cat3', 'feature1', 'subfeature2')],
('cat3', 'feature2'): [('cat3', 'feature2', 'subfeature1'),
('cat3', 'feature2', 'subfeature2')]}
有没有办法让groupby
函数忽略子特征层级进行分组?原因是我需要subfeature1
和subfeature2
一起使用,单独分开的话就没什么意义了。
所以理想情况下,我希望groupby
返回类似这样的结果:
{('cat1', 'feature1'): [('cat1', 'feature1')],
('cat1', 'feature2'): [('cat1', 'feature2')],
('cat2', 'feature1'): [('cat2', 'feature1')],
('cat2', 'feature2'): [('cat2', 'feature2')],
('cat3', 'feature1'): [('cat3', 'feature1')],
('cat3', 'feature2'): [('cat3', 'feature2')],
我该怎么做呢?
2 个回答
0
在杰夫的帮助下,我找到了一种有效的解决办法。
def f(x):
tmp = x.set_index('subfeatures')
a = (tmp.xs('subfeature1')-tmp.xs('subfeature2')).abs().max()
return a > 1
df.reset_index('subfeatures').groupby(level=['categories', 'features']).filter(f).set_index('subfeatures', append=True)
我基本上是在分组的时候忽略了 subfeatures
,然后在过滤函数里临时把它加回来,但这样做会丢失,所以我在过滤函数完成后再把它确定下来。
1
In [20]: df.reset_index(level='subfeatures').groupby(level=['categories','features']).groups
Out[20]:
{('cat1', 'feature1'): [('cat1', 'feature1'), ('cat1', 'feature1')],
('cat1', 'feature2'): [('cat1', 'feature2'), ('cat1', 'feature2')],
('cat2', 'feature1'): [('cat2', 'feature1'), ('cat2', 'feature1')],
('cat2', 'feature2'): [('cat2', 'feature2'), ('cat2', 'feature2')],
('cat3', 'feature1'): [('cat3', 'feature1'), ('cat3', 'feature1')],
('cat3', 'feature2'): [('cat3', 'feature2'), ('cat3', 'feature2')]}
当然可以!请把你想要翻译的内容发给我,我会帮你把它变得更简单易懂。