使用低层数据过滤层次化Pandas Dataframe中的顶级类别

1 投票
1 回答
842 浏览
提问于 2025-04-18 10:30

我有一个Pandas的数据表,这个表里有很多类别,每个类别都有一些特征,而这些特征又有自己的子特征,这些子特征是成对分组的。简单来说,它的结构大概是这样的:

                                        0         1    ...
categories features subfeatures                    
cat1       feature1 subfeature1 -0.224487 -0.227524
                    subfeature2 -0.591399 -0.799228
           feature2 subfeature1  1.190110 -1.365895    ...
                    subfeature2  0.720956 -1.325562
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473
           feature2 subfeature1  0.234075 -1.362235    ...
                    subfeature2  0.013875  1.309564
cat3       feature1 subfeature1       NaN       NaN
                    subfeature2 -1.260408  1.559721    ...
           feature2 subfeature1  0.419246  0.084386
                    subfeature2  0.969270  1.493417

...                    ...               ...

可以用以下代码生成:

import pandas as pd
import numpy as np

np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
                                    ["feature1", "feature2"], 
                                    ["subfeature1", "subfeature2"]], 
                                   names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)

现在我想找出那些顶级类别(比如cat1等),这些类别在同一列中(01)的subfeature1subfeature2之间的差值超过某个阈值。

举个例子:如果阈值设为1,那么我希望能找到cat2cat3,因为在cat2feature1中,subfeature1subfeature2在第0列的差值是1.856932 - (-1.354258),结果是3.21119,这个值大于阈值1。同样,cat3feature2在第1列的差值是1.493417 - 0.084386 = 1.409031,也大于1。另一方面,cat1不会被返回,因为它的子特征对之间的差值都没有超过1。NaN值会让一对子特征失效,因此会被忽略。

我尝试过的

我已经实现了一种迭代的方法,但我觉得没有充分利用Pandas的功能,而且性能也不太好:

for cat in df.index.levels[0]:
    for feature in df.index.levels[1]:
        df2 = df.xs((cat, feature))
        diffs = abs(df2.loc['subfeature1'] - df2.loc['subfeature2'])
        if max(diffs) > threshold and cat not in results:
            results.append(cat)

结果是:

['cat2', 'cat3']

我该如何利用Pandas内置的向量化能力来实现这样的功能呢?

编辑:根据Jeff的回答,我注意到了一些奇怪的地方:

def f(x):
    a = max(abs(x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')))
    print a
    return a > 1

result = df.groupby(level=['categories','features']).filter(f)
print(result)

结果是:

0.366912262765
0.571703714569
1
0.469153603312
0.0403331129905
3.2111900125 <------------------------------------------------
nan
0.220200012413
2.67179897269  <---------------------------------------------------
nan
nan
0.550023734074
1.40903094796  <-----------------------------------------------------!!!!!!!!!!!
                                        0         1
categories features subfeatures                    
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473

我标出了算法应该根据得分包含类别的所有地方。然而,对于cat3却没有包含。难道是NaN值导致的吗?

1 个回答

1

先把数据按前两级进行分组。然后用一个过滤器,只返回你想要的特征的最大差值(这里的阈值是0)。

In [41]: df.groupby(level=['categories','features']).filter(lambda x: (x.xs('subfeature1',level='subfeatures')-x.xs('subfeature2',level='subfeatures')).max()>0)
Out[41]: 
                                        0         1
categories features subfeatures                    
cat1       feature1 subfeature1 -0.224487 -0.227524
                    subfeature2 -0.591399 -0.799228
           feature2 subfeature1  1.190110 -1.365895
                    subfeature2  0.720956 -1.325562
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473
           feature2 subfeature1  0.234075 -1.362235
                    subfeature2  0.013875  1.309564

这是一个很有用的调试工具,可以用来做类似的事情:

def f(x):
    print x
    return (x.xs(......)) # e.g. the filter from above

df.groupby(.....).filter(f)

撰写回答