如何使用numpy计算"t检验"统计量
我想生成一些关于我在Python中创建的模型的统计数据。我想对它进行t检验,但不知道有没有简单的方法可以用numpy或scipy来实现。有没有好的解释可以参考一下?
比如,我有三个相关的数据集,它们看起来像这样:
[55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0]
现在,我想对它们进行学生t检验。
3 个回答
一旦你得到了t值,你可能会想知道怎么把它理解成一个概率——我也是这样想的。这里有一个我写的函数,可以帮助你理解这个问题。
这个函数是基于我从http://www.vassarstats.net/rsig.html和http://en.wikipedia.org/wiki/Student%27s_t_distribution上获取的信息。
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# of getting the observed values if there is actually no correlation in the given
# direction.
#
# direction:
# if positive, p is the probability of getting the observed result when there is no
# positive correlation in the normally distributed full populations sampled by X
# and Y
# if negative, p is the probability of getting the observed result, when there is no
# negative correlation
# if 0, p is the probability of getting your result, if your hypothesis is true that
# there is no correlation in either direction
def probabilityOfResult(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)
van的回答使用了scipy,确实是正确的,而且用scipy.stats.ttest_*
这些函数非常方便。
不过我来这个页面是想找一个纯用numpy的解决方案,正如标题所说的那样,想避免依赖scipy。为此,我想指出这里给出的一个例子:https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.standard_t.html
主要的问题是,numpy没有累积分布函数,所以我的结论是你真的应该使用scipy。不过,单独使用numpy也是可以的:
从原始问题来看,我猜你是想比较你的数据集,并用t检验来判断是否有显著的偏差?另外,样本是成对的吗?(可以参考https://en.wikipedia.org/wiki/Student%27s_t-test#Unpaired_and_paired_two-sample_t-tests)
如果是这样的话,你可以这样计算t值和p值:
import numpy as np
sample1 = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
sample2 = np.array([54.0, 56.0, 48.0, 46.0, 56.0, 56.0, 55.0, 62.0])
# paired sample -> the difference has mean 0
difference = sample1 - sample2
# the t-value is easily computed with numpy
t = (np.mean(difference))/(difference.std(ddof=1)/np.sqrt(len(difference)))
# unfortunately, numpy does not have a build in CDF
# here is a ridiculous work-around integrating by sampling
s = np.random.standard_t(len(difference), size=100000)
p = np.sum(s<t) / float(len(s))
# using a two-sided test
print("There is a {} % probability that the paired samples stem from distributions with the same means.".format(2 * min(p, 1 - p) * 100))
这段代码会输出有73.028%的概率认为这对样本来自均值相同的分布。
因为这个概率远高于任何合理的置信区间(比如5%),所以你不应该对具体情况得出任何结论。
在 scipy.stats 这个包里,有几个以 ttest_...
开头的函数。你可以在 这里 找到相关的例子:
>>> print 't-statistic = %6.3f pvalue = %6.4f' % stats.ttest_1samp(x, m)
t-statistic = 0.391 pvalue = 0.6955