计算卡方测试中使用的先前机会数问题的回答

计算卡方测试中使用的先前机会数

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

因此，我使用一个脚本来计算一个人在该行中指定的日期之前出现在列表中的次数，而1出现在第6列中的次数，同时还计算了一个人（第7列）在该行中指定的日期之前出现在列表中的次数（注意，它们是按时间顺序排序的）（使用从零开始的列引用） <h3>示例数据集</h3> <pre><code>02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith 02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James 02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly 02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith 02/01/2005,Data,Class xpv,4,11yo+,6,4,Tom Phillips 03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips 03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith 03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips 03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly 03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith 04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James 04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips 04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith 04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith 04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith 04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips 05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly 05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James 05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith 05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly 05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips </code></pre> <h3>我使用的代码</h3> ^{pr2}$ <h3>返回：</h3> <pre><code>02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith,0,0 02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James,0,0 02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly,0,0 02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith,0,0 02/01/2005,Data,Class xpv,4,11yo+,6,4,Tom Phillips,0,0 03/01/2005,Data,Class tn2,4,10yo+,6,2,Tom Phillips,0,1 03/01/2005,Data,Class tn2,4,10yo+,6,5,George Smith,1,2 03/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,1 03/01/2005,Data,Class tn2,4,10yo+,6,1,Emma Lilly,0,1 03/01/2005,Data,Class tn2,4,10yo+,6,6,George Smith,1,2 04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1 04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,3 04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4 04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4 04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4 04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3 05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2 05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2 05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7 05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2 05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5 </code></pre> 最终，我希望对我生成的百分比数据执行卡平方。不过，现在我想要实现的是能够计算和求出一个唯一数据类（第2列）中任何一个人的分数概率，并将其作为一个新列添加到csv中。我不确定我所使用的代码是否可以被编辑以实现这一点。如能就如何最好地做到这一点提出任何建设性的建议或意见，我们将不胜感激。在 <h3>我想要的输出是：</h3> <pre><code>02/01/2005,Data,Class xpv,4,11yo+,5,1,George Smith,0,0,0 02/01/2005,Data,Class xpv,4,11yo+,5,2,Ted James,0,0,0 02/01/2005,Data,Class xpv,4,11yo+,5,3,Emma Lilly,0,0,0 02/01/2005,Data,Class xpv,4,11yo+,5,5,George Smith,0,0,0 02/01/2005,Data,Class xpv,4,11yo+,5,4,Tom Phillips,0,0,0 03/01/2005,Data,Class tn2,4,10yo+,5,2,Tom Phillips,0,1,0.2, He gets 0.2 because there was a 1 in 5 chance for previous occurrences on dates prior to today. 1/5 03/01/2005,Data,Class tn2,4,10yo+,5,5,George Smith,1,2,0.4, He gets 0.4 because there was a 2 in 5 chance for previous occurrences on dates prior to today. 2/5 03/01/2005,Data,Class tn2,4,10yo+,5,3,Tom Phillips,0,1,0.2 03/01/2005,Data,Class tn2,4,10yo+,5,1,Emma Lilly,0,1,0.2 03/01/2005,Data,Class tn2,4,10yo+,5,6,George Smith,1,2,0.4 04/01/2005,Data,Class tn2,4,10yo+,6,6,Ted James,0,1,0.2 04/01/2005,Data,Class tn2,4,10yo+,6,3,Tom Phillips,0,3,0.6 04/01/2005,Data,Class tn2,4,10yo+,6,2,George Smith,1,4,0.8 04/01/2005,Data,Class tn2,4,10yo+,6,4,George Smith,1,4,0.8 04/01/2005,Data,Class tn2,4,10yo+,6,1,George Smith,1,4,0.8 04/01/2005,Data,Class tn2,4,10yo+,6,5,Tom Phillips,0,3,0.4 05/01/2005,Data,Class 22zn,2,10yo+,5,3,Emma Lilly,1,2,0.4 05/01/2005,Data,Class 22zn,2,10yo+,5,1,Ted James,0,2,0.366666667 05/01/2005,Data,Class 22zn,2,10yo+,5,2,George Smith,2,7,1.3 05/01/2005,Data,Class 22zn,2,10yo+,5,4,Emma Lilly,1,2,0.4 05/01/2005,Data,Class 22zn,2,10yo+,5,5,Tom Phillips,0,5,0.733333333 </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这不应该是对您的问题的完整回答（因为这有点模棱两可，您正在尝试做什么），而只是向您展示<a href="http://pandas.pydata.org/" rel="nofollow">pandas</a>如何自然地适应这种计算；您还可以通过名称而不是按索引调用列。在 假设您有这样一个<code>test.csv</code>文件： <pre><code>date,x0,cls,x1,x2,x3,tag,name 02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith 02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James 02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly 02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith ... </code></pre> 我给每一列都指定了名字。您可以通过以下方式将此文件读入pandas数据帧 ^{pr2}$ {cd2>看起来像这样： <pre><code> date x0 cls x1 x2 x3 tag name 0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith 1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James 2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly 3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith ... </code></pre> 我删除您不使用的列（这只是为了演示，您不必删除这些列） <pre><code>df.drop( labels=['x0', 'x1', 'x2', 'x3'], axis=1, inplace=True ) </code></pre> 现在<code>df</code>如下所示： <pre><code> date cls tag name 0 02/01/2005 Class xpv 1 George Smith 1 02/01/2005 Class xpv 2 Ted James 2 02/01/2005 Class xpv 3 Emma Lilly 3 02/01/2005 Class xpv 5 George Smith ... </code></pre> 假设您想找出每个人每天在之前的日期中出现的累计次数： <pre><code>pv = df.pivot_table( cols='name', rows='date', values='tag', aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( ) </code></pre> api文档（请参见<a href="http://pandas.pydata.org/pandas-docs/stable/api.html" rel="nofollow">here</a>）包含每个方法的详细描述。现在有了透视表<code>pv</code>，它看起来像这样 <pre><code>date Emma Lilly George Smith Ted James Tom Phillips 02/01/2005 0 0 0 0 03/01/2005 1 2 1 1 04/01/2005 2 4 1 3 05/01/2005 2 7 2 5 </code></pre> 或者可以使用<code>groupby</code>： <pre><code>df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( ) </code></pre> 要执行相同的计算，但只针对<code>tag == 1</code>，可以这样做 <pre><code>idx = df.tag == 1 pv1 = df[ idx ].pivot_table( cols='name', rows='date', values='tag', aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( ) </code></pre> 或使用<code>groupby</code>语法： <pre><code>df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( ) </code></pre> 将是： <pre><code>date Emma Lilly George Smith Ted James 02/01/2005 0 0 0 03/01/2005 0 1 0 04/01/2005 1 1 0 05/01/2005 1 2 0 </code></pre> 为了填写这两个新列，我们编写了一个helper函数，如果缺少值，则返回到0： <pre><code>def lookup( pivot_table, col, idx, fall_back=0 ): try: return pivot_table[ col ][ idx ] except KeyError: return fall_back df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ] df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ] </code></pre> 我们得到： <pre><code> date cls tag name cnt1 cnt 0 02/01/2005 Class xpv 1 George Smith 0 0 1 02/01/2005 Class xpv 2 Ted James 0 0 2 02/01/2005 Class xpv 3 Emma Lilly 0 0 3 02/01/2005 Class xpv 5 George Smith 0 0 4 02/01/2005 Class tn2 4 Tom Phillips 0 0 5 03/01/2005 Class tn2 2 Tom Phillips 0 1 6 03/01/2005 Class tn2 5 George Smith 1 2 7 03/01/2005 Class tn2 3 Tom Phillips 0 1 8 03/01/2005 Class tn2 1 Emma Lilly 0 1 9 03/01/2005 Class tn2 6 George Smith 1 2 10 04/01/2005 Class tn2 6 Ted James 0 1 11 04/01/2005 Class tn2 3 Tom Phillips 0 3 12 04/01/2005 Class tn2 2 George Smith 1 4 13 04/01/2005 Class tn2 4 George Smith 1 4 14 04/01/2005 Class tn2 1 George Smith 1 4 15 04/01/2005 Class tn2 5 Tom Phillips 0 3 16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 17 05/01/2005 Class 22zn 1 Ted James 0 2 18 05/01/2005 Class 22zn 2 George Smith 2 7 19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 </code></pre> 如果我知道你是如何计算最后一个专栏的话，我可以继续下去。例如为什么“汤姆·菲利普斯”在第六排得了0.2？！在 编辑：好的，我们继续。我们需要找出每个人在每个日期出现的次数；这是另一个数据透视表： <pre><code>appr = df.pivot_table( cols='name', rows='date', values='tag', aggfunc=len ).fillna( 0 ) </code></pre> 或者 <pre><code>df.groupby( ['date', 'name'] )['name'].aggregate(len).unstack( ).fillna( 0 ) </code></pre> 输出： <pre><code>date Emma Lilly George Smith Ted James Tom Phillips 02/01/2005 1 2 1 1 03/01/2005 1 2 0 2 04/01/2005 0 3 1 2 05/01/2005 2 1 1 1 </code></pre> 每次约会有多少人出现： <pre><code>total_appr = appr.sum( axis=1 ) </code></pre> 输出： <pre><code>date 02/01/2005 5 03/01/2005 5 04/01/2005 6 05/01/2005 5 </code></pre> 要计算累积分数，您可以简单地将每行除以总数，再除以1（因为我们查找以前的日期），然后计算累计和： <pre><code>frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( ) df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ] </code></pre> 现在<code>df</code>如下所示： <pre><code> date cls tag name cnt1 cnt frac 0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000 1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000 2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000 3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000 4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000 5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000 6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000 7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000 8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000 9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000 10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000 11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000 12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000 13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000 14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000 15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000 16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000 17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667 18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000 19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000 20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333 </code></pre> 在最后一列的两行，我的数字和你的不一样。所以要么我把你的计算弄错了，要么你把这两个数字算错了。在

计算卡方测试中使用的先前机会数

1 个回答

相关Python问题