使用低层数据过滤层次化Pandas Dataframe中的顶级类别
我有一个Pandas的数据表,这个表里有很多类别,每个类别都有一些特征,而这些特征又有自己的子特征,这些子特征是成对分组的。简单来说,它的结构大概是这样的:
0 1 ...
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ... ...
可以用以下代码生成:
import pandas as pd
import numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
["feature1", "feature2"],
["subfeature1", "subfeature2"]],
names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)
现在我想找出那些顶级类别(比如cat1
等),这些类别在同一列中(0
或1
)的subfeature1
和subfeature2
之间的差值超过某个阈值。
举个例子:如果阈值设为1,那么我希望能找到cat2
和cat3
,因为在cat2
的feature1
中,subfeature1
和subfeature2
在第0
列的差值是1.856932 - (-1.354258),结果是3.21119,这个值大于阈值1。同样,cat3
的feature2
在第1
列的差值是1.493417 - 0.084386 = 1.409031,也大于1。另一方面,cat1
不会被返回,因为它的子特征对之间的差值都没有超过1。NaN
值会让一对子特征失效,因此会被忽略。
我尝试过的
我已经实现了一种迭代的方法,但我觉得没有充分利用Pandas的功能,而且性能也不太好:
for cat in df.index.levels[0]:
for feature in df.index.levels[1]:
df2 = df.xs((cat, feature))
diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])
if max(diffs) > threshold and cat not in results:
results.append(cat)
结果是:
['cat2', 'cat3']
我该如何利用Pandas内置的向量化能力来实现这样的功能呢?
编辑:根据Jeff的回答,我注意到了一些奇怪的地方:
def f(x):
a = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))
print a
return a > 1
result = df.groupby(level=['categories','features']).filter(f)
print(result)
结果是:
0.366912262765
0.571703714569
1
0.469153603312
0.0403331129905
3.2111900125 <------------------------------------------------
nan
0.220200012413
2.67179897269 <---------------------------------------------------
nan
nan
0.550023734074
1.40903094796 <-----------------------------------------------------!!!!!!!!!!!
0 1
categories features subfeatures
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
我标出了算法应该根据得分包含类别的所有地方。然而,对于cat3
却没有包含。难道是NaN
值导致的吗?
1 个回答
1
先把数据按前两级进行分组。然后用一个过滤器,只返回你想要的特征的最大差值(这里的阈值是0)。
In [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0)
Out[41]:
0 1
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235
subfeature2 0.013875 1.309564
这是一个很有用的调试工具,可以用来做类似的事情:
def f(x):
print x
return (x.xs(......)) # e.g. the filter from above
df.groupby(.....).filter(f)