整数索引的Pandas等价重采样

3条回答

网友

1楼 · 编辑于 2024-05-12 19:19:12

@piSquared解决方案真的很好，但我不喜欢在重新编制索引时按手挑选索引。

这也适用于每种下采样（浮动索引也适用），并自动选取每个范围内索引的平均值：

df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'

s = (df.index.to_series() / 10).astype(int)

现在您可以随意选择要在每个子组中计算的函数：

# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

                    A         B
crazy_index
3.667539     0.276986  0.317642
14.275074    0.248700  0.372551
25.054042    0.254860  0.297586

# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
                    A         B
crazy_index
3.667539     0.454654  0.521649
14.275074    0.451265  0.490125
25.054042    0.489326  0.622781

编辑：索引中有一些错误，现在可以正常工作了。

网友

2楼 · 编辑于 2024-05-12 19:19:12

另外，这是一件可以做的事

def resample(df, rule, how=None, **kwargs):
    import pandas as pd
    if how==None:
        import numpy as np
        how = np.mean

    if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
        return df.resample(rule, how, **kwargs)
    else:
        idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
        aux = df.groupby(idx).apply(how)
        aux = aux.set_index(bins[:-1])
        return aux

网友

3楼 · 编辑于 2024-05-12 19:19:12

设置

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

您需要自己创建要分组的标签。我会用：

(df.index.to_series() / 5).astype(int)

要获得一系列的值，比如[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]，那么在groupby中使用这个

您还需要指定新数据帧的索引。我会用：

df.index[4::5]

从第5个位置（因此是4）开始，然后每隔5个位置获取一个当前索引。它看起来像[4, 9, 14, 19]。我本来可以做df.index[::5]来获得起始位置，但我选择了结束位置。

溶液

# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)

df.groupby(s).std().set_index(s.index[4::5])

看起来像：

           A         B
4   0.198019  0.320451
9   0.329750  0.408232
14  0.293297  0.223991
19  0.095633  0.376390

其他注意事项

这相当于下采样。我们还没有解决抽样问题。

要通过更频繁的操作返回到数据帧索引，可以使用reindex，如下所示：

# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])

df_up = df_down.reindex(range(20)).bfill()

看起来像：

           A         B
0   0.198019  0.320451
1   0.198019  0.320451
2   0.198019  0.320451
3   0.198019  0.320451
4   0.198019  0.320451
5   0.329750  0.408232
6   0.329750  0.408232
7   0.329750  0.408232
8   0.329750  0.408232
9   0.329750  0.408232
10  0.293297  0.223991
11  0.293297  0.223991
12  0.293297  0.223991
13  0.293297  0.223991
14  0.293297  0.223991
15  0.095633  0.376390
16  0.095633  0.376390
17  0.095633  0.376390
18  0.095633  0.376390
19  0.095633  0.376390

我们还可以使用其他东西来reindex，比如range(0, 20, 2)来将样本提升到偶数整数索引。

设置

溶液

其他注意事项

相关问题更多 >

编程相关推荐

热门问题

热门文章