计算多索引数据帧中值出现次数的最快方法

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]],names= ['index_1','index_2']) col = ['column_1', 'column_2'] values_list_a=[[1,2],[2,2],[2,1],[-8,1],[2,0],[2,1]] DFA = pd.DataFrame(values_list_a, idx, col) DFA: columns_1 columns2 index_1 index_2 0 0 1 2 1 2 2 2 2 1 1 0 -8 1 1 2 0 2 2 1 values_list_b=[[2,2],[0,1],[2,2],[2,2],[1,0],[1,2]] DFB = pd.DataFrame(values_list_b, idx, col) DFB: columns_1 columns2 index_1 index_2 0 0 2 2 1 0 1 2 2 2 1 0 2 2 1 1 0 2 1 2

DFA: columns_1 columns2 counts index_1 index_2 0 0 1 2 1 1 2 2 2 2 2 1 1 1 0 -8 1 0 1 2 0 1 2 2 1 1 DFB: columns_1 columns2 counts index_1 index_2 0 0 2 2 2 1 0 1 0 2 2 2 2 1 0 2 2 2 1 1 0 0 2 1 2 1

DFC: columns_1 columns2 counts index_0 index_1 index_2 0 0 0 1 2 1 2 2 1 1 1 2 2 1 1 1 0 0 2 2 2 2 2 2 2 1 2 1 2 1

3条回答

网友

1楼 · 编辑于 2024-04-25 09:48:50

df.groupby(['index_0','index_1', 'index2'])

现在，您需要使用与sql等价的

df.filter(lambda x: len(x.columns_1) > 2)
df.count()

这是个概念，我不明白你想过滤什么，请注意，x是一个组，因此需要对其进行运算（len、set、values）等

网友

2楼 · 编辑于 2024-04-25 09:48:50

`pd.concat`然后`magic`

def f(d, thresh=1):
    c = d.gt(thresh).sum(1)
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    return d.assign(counts=c)[mask]

pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

                         column_1  column_2  counts
index_0 index_1 index_2                            
bar     0       0               1         2       1
                2               2         1       1
        1       2               2         1       1
foo     0       0               2         2       2
                2               2         2       2
        1       2               1         2       1

有意见

def f(d, thresh=1):
    # count how many are greater than a threshold `thresh` per row
    c = d.gt(thresh).sum(1)

    # find where `counts` are > `0` for both dataframes
    # conveniently dropped into one dataframe so we can do
    # this nifty `groupby` trick
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    #                                    \   -/
    #                         This is key to broadcasting over 
    #                         original index rather than collapsing
    #                         over the index levels we grouped by

    #     create a new column named `counts`
    #         /      \ 
    return d.assign(counts=c)[mask]
    #                         \ /
    #                    filter with boolean mask

# Use concat to smash two dataframes together into one
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

网友

3楼 · 编辑于 2024-04-25 09:48:50

使用filter、.any（）和pd.合并（）

重新创建数据帧：

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]], names=['one', 'two'])
columns = ['columns_1', 'columns_2']

DFA = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)
DFB = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
    2           -1          0
1   0            1          2
    1            0          0
    2           -1         -1



print(DFB)

             columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   0            0          0
    1           -1          2
    2            1         -1

在此实例中，筛选值大于1的数据帧。你知道吗

DFA = DFA.loc[(DFA>1).any(bool_only=True, axis=1),:]
DFB = DFB.loc[(DFB>1).any(bool_only=True, axis=1),:]

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
1   0            1          2

print(DFB)

        columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   1           -1          2

将两者合并在一起。使用out join可以让你更接近。不确定是否要跳出索引，但是第一级0[0,1]是DFA。你知道吗

         columns_1_x  columns_2_x  columns_1_y  columns_2_y
one two                                                    
0   0           -1.0          2.0          2.0         -1.0
    1            2.0         -1.0          1.0          2.0
1   0            1.0          2.0          NaN          NaN
0   2            NaN          NaN          2.0          1.0
1   1            NaN          NaN         -1.0          2.0

`pd.concat`然后`magic`

有意见

相关问题更多 >

编程相关推荐

热门问题

热门文章