<p>这不应该是对您的问题的完整回答(因为这有点模棱两可,您正在尝试做什么),而只是向您展示<a href="http://pandas.pydata.org/" rel="nofollow">pandas</a>如何自然地适应这种计算;您还可以通过名称而不是按索引调用列。在</p>
<p>假设您有这样一个<code>test.csv</code>文件:</p>
<pre><code>date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
</code></pre>
<p>我给每一列都指定了名字。您可以通过以下方式将此文件读入pandas数据帧</p>
^{pr2}$
<p>{cd2>看起来像这样:</p>
<pre><code> date x0 cls x1 x2 x3 tag name
0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith
1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James
2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly
3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith
...
</code></pre>
<p>我删除您不使用的列(这只是为了演示,您不必删除这些列)</p>
<pre><code>df.drop( labels=['x0', 'x1', 'x2', 'x3'], axis=1, inplace=True )
</code></pre>
<p>现在<code>df</code>如下所示:</p>
<pre><code> date cls tag name
0 02/01/2005 Class xpv 1 George Smith
1 02/01/2005 Class xpv 2 Ted James
2 02/01/2005 Class xpv 3 Emma Lilly
3 02/01/2005 Class xpv 5 George Smith
...
</code></pre>
<p>假设您想找出每个人每天在<em>之前的日期中出现的累计次数:</p>
<pre><code>pv = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
</code></pre>
<p>api文档(请参见<a href="http://pandas.pydata.org/pandas-docs/stable/api.html" rel="nofollow">here</a>)包含每个方法的详细描述。现在有了透视表<code>pv</code>,它看起来像这样</p>
<pre><code>date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 0 0 0 0
03/01/2005 1 2 1 1
04/01/2005 2 4 1 3
05/01/2005 2 7 2 5
</code></pre>
<p>或者可以使用<code>groupby</code>:</p>
<pre><code>df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
</code></pre>
<p>要执行相同的计算,但只针对<code>tag == 1</code>,可以这样做</p>
<pre><code>idx = df.tag == 1
pv1 = df[ idx ].pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
</code></pre>
<p>或使用<code>groupby</code>语法:</p>
<pre><code>df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
</code></pre>
<p>将是:</p>
<pre><code>date Emma Lilly George Smith Ted James
02/01/2005 0 0 0
03/01/2005 0 1 0
04/01/2005 1 1 0
05/01/2005 1 2 0
</code></pre>
<p>为了填写这两个新列,我们编写了一个helper函数,如果缺少值,则返回到0:</p>
<pre><code>def lookup( pivot_table, col, idx, fall_back=0 ):
try:
return pivot_table[ col ][ idx ]
except KeyError:
return fall_back
df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
</code></pre>
<p>我们得到:</p>
<pre><code> date cls tag name cnt1 cnt
0 02/01/2005 Class xpv 1 George Smith 0 0
1 02/01/2005 Class xpv 2 Ted James 0 0
2 02/01/2005 Class xpv 3 Emma Lilly 0 0
3 02/01/2005 Class xpv 5 George Smith 0 0
4 02/01/2005 Class tn2 4 Tom Phillips 0 0
5 03/01/2005 Class tn2 2 Tom Phillips 0 1
6 03/01/2005 Class tn2 5 George Smith 1 2
7 03/01/2005 Class tn2 3 Tom Phillips 0 1
8 03/01/2005 Class tn2 1 Emma Lilly 0 1
9 03/01/2005 Class tn2 6 George Smith 1 2
10 04/01/2005 Class tn2 6 Ted James 0 1
11 04/01/2005 Class tn2 3 Tom Phillips 0 3
12 04/01/2005 Class tn2 2 George Smith 1 4
13 04/01/2005 Class tn2 4 George Smith 1 4
14 04/01/2005 Class tn2 1 George Smith 1 4
15 04/01/2005 Class tn2 5 Tom Phillips 0 3
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2
17 05/01/2005 Class 22zn 1 Ted James 0 2
18 05/01/2005 Class 22zn 2 George Smith 2 7
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5
</code></pre>
<p>如果我知道你是如何计算最后一个专栏的话,我可以继续下去。例如为什么“汤姆·菲利普斯”在第六排得了0.2?!在</p>
<p><strong>编辑</strong>:好的,我们继续。我们需要找出每个人在每个日期出现的次数;这是另一个数据透视表:</p>
<pre><code>appr = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).fillna( 0 )
</code></pre>
<p>或者</p>
<pre><code>df.groupby( ['date', 'name'] )['name'].aggregate(len).unstack( ).fillna( 0 )
</code></pre>
<p>输出:</p>
<pre><code>date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 1 2 1 1
03/01/2005 1 2 0 2
04/01/2005 0 3 1 2
05/01/2005 2 1 1 1
</code></pre>
<p>每次约会有多少人出现:</p>
<pre><code>total_appr = appr.sum( axis=1 )
</code></pre>
<p>输出:</p>
<pre><code>date
02/01/2005 5
03/01/2005 5
04/01/2005 6
05/01/2005 5
</code></pre>
<p>要计算累积分数,您可以简单地将每行除以总数,再除以1(因为我们查找以前的日期),然后计算累计和:</p>
<pre><code>frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( )
df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ]
</code></pre>
<p>现在<code>df</code>如下所示:</p>
<pre><code> date cls tag name cnt1 cnt frac
0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000
1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000
2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000
3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000
4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000
5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000
6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000
7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000
8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000
9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000
10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000
11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000
12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000
13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000
14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000
15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000
17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667
18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333
</code></pre>
<p>在最后一列的两行,我的数字和你的不一样。所以要么我把你的计算弄错了,要么你把这两个数字算错了。在</p>