<p>更新附加信息</p>
<p>数据:</p>
<pre><code>import pandas as pd
import numpy as np
df = pd.DataFrame({'date':['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01']*4,
'country_id':[1]*8+[2]*8,
'company_id':[1]*4+[2]*4+[1]*4+[2]*4,
'value':[1, 0, 2, np.nan, 1, 2, np.nan, np.nan, 3, 0, 2, np.nan, 1, 2, np.nan, np.nan]})
</code></pre>
<p>在<code>country_id</code>内创建滚动求和</p>
<pre><code>df['rolling_sum'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).sum()).reset_index(drop=True)
</code></pre>
<p>在<code>country_id</code>内创建滚动计数</p>
<pre><code>df['sum_records'] = df.groupby('country_id').apply(lambda x: x.value.rolling(window=2, min_periods=1).count()).reset_index(drop=True)
</code></pre>
<p>现在groupby在<code>country_id</code>和<code>date</code>内,求和,除以计数和</p>
<pre><code>summarized_df = df.groupby(['country_id', 'date']).apply(lambda x: x.rolling_sum.sum()/x.sum_records.sum()).reset_index()
country_id date
1 2018-01-01 1.000000
2018-02-01 1.000000
2018-03-01 1.333333
2018-04-01 2.000000
2 2018-01-01 2.000000
2018-02-01 1.500000
2018-03-01 1.333333
2018-04-01 2.000000
</code></pre>
<p>让我们更详细地看一下。因为我们是按国家/地区id分组的,所以我们将对单个国家/地区id进行子集划分,以便在以下方面实践此方法:</p>
<p>如果我们只取其中一块,就说<code>country_id == 1</code>:</p>
<pre><code>df2 = df[df['country_id'] == 1]
date country_id company_id value
0 2018-01-01 1 1 1.0
1 2018-02-01 1 1 0.0
2 2018-03-01 1 1 2.0
3 2018-04-01 1 1 NaN
4 2018-01-01 1 2 1.0
5 2018-02-01 1 2 2.0
6 2018-03-01 1 2 NaN
7 2018-04-01 1 2 NaN
</code></pre>
<p>如果我们想要这一次的滚动平均数,我们可以做:</p>
<pre><code>df2.value.rolling(window=2, min_periods=1).mean()
0 1.0
1 0.5
2 1.0
3 2.0
4 1.0
5 1.5
6 2.0
7 NaN
</code></pre>
<p>我们可以在这里看到,我们的子集country_id==1数据帧的值以及它们与滚动平均值的关系:</p>
<pre><code>0 1.0 = (1)/1 = 1
1 0.0 = (0 + 1)/2 = 0.5
2 2.0 = (2 + 0)/2 = 1
3 NaN = (Nan + 2)/1 = 2
4 1.0 = (1 + Nan)/1 = 1
5 2.0 = (2 + 1)/2 = 1.5
6 NaN = (Nan + 2)/1 = 2
7 NaN = (Nan + Nan)/0 = Nan
</code></pre>
<p>这就是我们如何得到一组<code>country_id</code>的滚动平均数</p>
<p><em>如果</em>我们想按日期进行分组,我们首先按国家/地区id进行分组,然后按日期进行分组,则单个组将如下所示:</p>
<pre><code>df3 = df[(df['country_id'] == 1) & (df['date'] == '2018-03-01')]
df3.value
2 2.0
6 NaN
df3.value.rolling(window=2, min_periods=1).mean()
2 2.0
6 2.0
df3.value
2 2.0 = (2)/1 = 2
6 NaN = (Nan + 2)/1 = 2
</code></pre>
<p>这里的问题是,您希望滚动平均值<em>首先</em>按<code>country_id</code>,而不是按<code>date</code>分组。<em>然后,在找到按国家划分的滚动平均值后,您需要取这些值并平均它们。如果我们取滚动平均值,然后取这些平均值,结果就不正确了。你知道吗</p>
<p>那么让我们回到我们为<code>country_id == 1</code>创建的原始滚动平均值,看看日期:</p>
<pre><code>2018-01-01 1.0 = (1)/1 = 1
2018-02-01 0.0 = (0 + 1)/2 = 0.5
2018-03-01 2.0 = (2 + 0)/2 = 1
2018-04-01 NaN = (Nan + 2)/1 = 2
2018-01-01 1.0 = (1 + Nan)/1 = 1
2018-02-01 2.0 = (2 + 1)/2 = 1.5
2018-03-01 NaN = (Nan + 2)/1 = 2
2018-04-01 NaN = (Nan + Nan)/0 = Nan
</code></pre>
<p>现在比较棘手的是,在这一点上,我们不能把它们平均在一起,因为例如,如果你看2018-03-01滚动平均值,我们有1和2,也就是3。除以2等于1.5。你知道吗</p>
<p>我们必须首先对滚动值求和,然后除以记录数。你知道吗</p>