Pandas多重索引数据框第一行某列设为0

5 投票

3 回答

3158 浏览

提问于 2025-04-18 17:01

我在使用pandas处理分组对象时遇到了一些问题。具体来说，我想把某一列的第一行设置为0，而其他行保持不变。

比如说：

df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2,
                        'B': rand.randn(6),
                        'C': rand.rand(6) > .5})

这给我带来了

    A         B      C
0  foo  1.624345  False
1  bar -0.611756   True
2  baz -0.528172  False
3  foo -1.072969   True
4  bar  0.865408  False
5  baz -2.301539   True

然后我按A分组，并按B排序：

f = lambda x: x.sort('B', ascending=True)
sort_df = df.groupby('A',sort=False).apply(f)

得到这个结果：

         A         B      C
    A                          
foo 3  foo -1.072969   True
    0  foo  1.624345  False
bar 1  bar -0.611756   True
    4  bar  0.865408  False
baz 5  baz -2.301539   True
    2  baz -0.528172  False

现在我有了这些分组，我想把每个组的第一个元素设置为0。我该怎么做呢？

像这样的方法可以实现，但我想要一种更优化的方式：

for group in sort_df.groupby(level=0).groups:
    sort_df.loc[(group,sort_df.loc[group].index[0]),'B']=0

任何帮助都非常感谢！

数据处理 pandas 数据框多重索引优化方法分组对象

3 个回答

你已经在用一个函数来完成一些工作了，为什么不把它直接放进去呢？

与其使用

lambda f: ...

不如直接用：

def f(x):
    x = x.sort('B', ascending=True)
    x.iloc[0, 1] = 0
    return x

sort_df = df.groupby('A',sort=False).apply(f)

回答于 2025-04-18 由 Python大师

分享举报

这就是你在找的东西吗？

sort_df.B[::2]=0

比如说：

sort_df

          A         B      C
A                          
foo 0  foo  0.192347   True
    3  foo  0.295985   True
bar 1  bar  0.012400  False
    4  bar  0.628488   True
baz 5  baz  0.180934   True
    2  baz  0.328735   True


sort_df.B[::2]=0

sort_df
         A         B      C
A                          
foo 0  foo  0.000000   True
    3  foo  0.295985   True
bar 1  bar  0.000000  False
    4  bar  0.628488   True
baz 5  baz  0.000000   True
    2  baz  0.328735   True

只有在 all(df.A.value_counts()==df.A.value_counts()[0]) 为真时，这个才有效。

回答于 2025-04-18 由 Python大师

分享举报

这里有一种向量化的方法来实现这个，速度会快很多。

In [26]: pd.set_option('max_rows',10)

首先创建一个有两个层级的多重索引的数据框，然后根据A进行排序（这里随便选择一个排序方式）。

In [27]: df = DataFrame(dict(A = np.random.randint(0,100,size=N),B=np.random.randint(0,100,size=N),C=np.random.randn(N))).sort(columns=['A'])

In [28]: df
Out[28]: 
        A   B         C
61474   0  40 -0.731163
82386   0  18 -1.316136
63372   0  28  0.112819
49666   0  13 -0.649491
31631   0  89 -0.835208
...    ..  ..       ...
42178  99  28 -0.029800
59529  99  31 -0.733588
13503  99  60  0.672754
20961  99  18  0.252714
31882  99  22  0.083340

[100000 rows x 3 columns]

接着重置索引，以便捕捉到索引的值。然后根据B找到第一个值。

In [29]: grouped = df.reset_index().groupby('B').first()

In [30]: grouped
Out[30]: 
    index  A         C
B                     
0   26576  0  1.123605
1   38311  0  0.128966
2   45135  0 -0.039886
3   38434  0 -1.284028
4   82088  0 -0.747440
..    ... ..       ...
95  82620  0 -1.197625
96  63278  0 -0.625400
97  23226  0 -0.497609
98  82520  0 -0.828773
99  35902  0 -0.199752

[100 rows x 3 columns]

这样你就得到了一个可以用来访问数据框的索引。

In [31]: df.loc[grouped['index']] = 0

In [32]: df
Out[32]: 
        A   B         C
61474   0   0  0.000000
82386   0   0  0.000000
63372   0   0  0.000000
49666   0   0  0.000000
31631   0   0  0.000000
...    ..  ..       ...
42178  99  28 -0.029800
59529  99  31 -0.733588
13503  99  60  0.672754
20961  99  18  0.252714
31882  99  22  0.083340

[100000 rows x 3 columns]

如果你想的话

In [33]: df.sort_index()
Out[33]: 
        A   B         C
0      40  56 -1.223941
1      24  77 -0.039775
2       7  83  0.741013
3      48  38 -1.795053
4      62  15 -2.734968
...    ..  ..       ...
99995  20  25 -0.286300
99996  27  21 -0.120430
99997   0   4  0.607524
99998  38  31  0.717069
99999  33  63 -0.226888

[100000 rows x 3 columns]

这种方法

In [34]: %timeit df.loc[grouped['index']] = 0
100 loops, best of 3: 7.33 ms per loop

你最初的

In [37]: %timeit df.groupby('A',sort=False).apply(f)
10 loops, best of 3: 109 ms per loop

如果你有更多的分组，这种性能差异会更明显。

回答于 2025-04-18 由 Python大师

分享举报

Pandas多重索引数据框第一行某列设为0

3 个回答

撰写回答