<h3>数据集</h3>
<p>根据您提供的数据集:</p>
<pre><code>import io
from scipy import stats
import pandas as pd
s = """id|usage_day|dow|tow|daily_avg
c96|01/09/2020|Tuesday|week|393.07
c96|02/09/2020|Wednesday|week|10.38
c96|03/09/2020|Thursday|week|429.35
c96|04/09/2020|Friday|week|156.20
c96|05/09/2020|Saturday|weekend|346.22
c96|06/09/2020|Sunday|weekend|106.53
c96|08/09/2020|Tuesday|week|194.74
c96|10/09/2020|Thursday|week|66.30
c96|17/09/2020|Thursday|week|163.84
c96|18/09/2020|Friday|week|261.81
c96|19/09/2020|Saturday|weekend|410.30
c96|20/09/2020|Sunday|weekend|266.28
c96|23/09/2020|Wednesday|week|346.18
c96|24/09/2020|Thursday|week|20.67
c96|25/09/2020|Friday|week|222.23
c96|26/09/2020|Saturday|weekend|449.84
c96|27/09/2020|Sunday|weekend|438.47
c96|28/09/2020|Monday|week|10.44
c96|29/09/2020|Tuesday|week|293.59
c96|30/09/2020|Wednesday|week|194.49"""
df = pd.read_csv(io.StringIO(s), sep='|')
</code></pre>
<p>为了<code>groupby</code>清晰起见,我添加了一个具有类似数据的新<code>id</code>:</p>
<pre><code>df2 = df.copy()
df2['id'] = 'c97'
df = pd.concat([df, df2])
</code></pre>
<h3>MCVE</h3>
<p>您不必求助于任何显式循环,而是<strong>利用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html" rel="nofollow noreferrer">^{<cd3>}</a>方法,该方法对帧进行操作,也可与<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd2>}</a></strong>一起使用</p>
<p>为此,我们定义了一个函数,该函数在数据帧上执行所需的测试(<code>groupby</code>将为对应于分组键组合的每个子数据帧调用此方法):</p>
<pre><code>def ttest(x):
g = x.groupby('tow').agg({'daily_avg': list})
r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
s = {k: getattr(r, k) for k in r._fields}
return pd.Series(s)
</code></pre>
<p>然后,在<code>groupby</code>调用之后链接<code>apply</code>就足够了:</p>
<pre><code>T = df.groupby('id').apply(ttest)
</code></pre>
<p>结果是:</p>
<pre><code> statistic pvalue
id
c96 -2.128753 0.059126
c97 -2.128753 0.059126
</code></pre>
<h3>重构</h3>
<p>一旦您了解了这种方法的威力,就可以将上述代码重构为可重用的函数,例如:</p>
<pre><code>def ttest(x, y):
return stats.ttest_ind(x, y, equal_var=False)
def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
g = x.groupby(subgroup).agg({value: list})
r = test(g.loc[key1, value], g.loc[key2, value])
return pd.Series({k: getattr(r, k) for k in r._fields})
T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')
</code></pre>
<p>它允许您根据需要调整统计测试和数据帧列</p>