pandas:快速计算具有特定值的列的和

2024-05-12 18:49:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,我需要计算一列值的总和,这些值落在某个窗口内。举个例子,如果我有一个500的窗口,我的初始值是1000,我想把499和999之间,以及1001和1501之间的所有值相加。你知道吗

用一些数据更容易解释:

    chrom   pos end AFR EUR pi
0   1   10177   10177   0.4909  0.4056  0.495988
1   1   10352   10352   0.4788  0.4264  0.496369
2   1   10617   10617   0.9894  0.9940  0.017083
3   1   11008   11008   0.1346  0.0885  0.203142
4   1   11012   11012   0.1346  0.0885  0.203142
5   1   13110   13110   0.0053  0.0567  0.053532
6   1   13116   13116   0.0295  0.1869  0.176091
7   1   13118   13118   0.0295  0.1869  0.176091
8   1   13273   13273   0.0204  0.1471  0.139066
9   1   13550   13550   0.0008  0.0080  0.007795
10  1   14464   14464   0.0144  0.1859  0.161422
11  1   14599   14599   0.1210  0.1610  0.238427
12  1   14604   14604   0.1210  0.1610  0.238427
13  1   14930   14930   0.4811  0.5209  0.500209
14  1   14933   14933   0.0015  0.0507  0.044505
15  1   15211   15211   0.5371  0.7316  0.470848
16  1   15585   15585   0.0008  0.0020  0.002635
17  1   15644   15644   0.0008  0.0080  0.007795
18  1   15777   15777   0.0159  0.0149  0.030470
19  1   15820   15820   0.4849  0.2714  0.477153
20  1   15903   15903   0.0431  0.4652  0.349452
21  1   16071   16071   0.0091  0.0010  0.011142
22  1   16142   16142   0.0053  0.0020  0.007721
23  1   16949   16949   0.0227  0.0159  0.038759
24  1   18643   18643   0.0023  0.0080  0.009485
25  1   18849   18849   0.8411  0.9911  0.170532
26  2   30923   30923   0.6687  0.9364  0.338400
27  2   20286   46286   0.0053  0.0010  0.006863
28  2   21698   46698   0.0015  0.0010  0.002566
29  2   42159   47159   0.0083  0.0696  0.067187

所以我需要根据前两列进行子集划分。例如,如果my window=500、my chrom=1和my pos=15500,则需要将df子集为仅包含chrom=1和15000>;pos<;16000的行。你知道吗

然后我想对这个数据子集的AFR列求和。你知道吗

以下是我所做的功能:

#vdf is my main dataframe, 
#polyChrom is the chromosome to subset by,     
#polyPos is the position to subset by. 
#Distance is how far the window should be from the polyPos. 
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is 
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):

#If start position becomes less than 0, set it to 0
    if(polyPos - distance < 0):
        start1 = 0
        end1 = windowSize
    else:
        start1 = polyPos - distance
        end1 = start1 + windowSize

    end2 = polyPos + distance
    start2 = end2 - windowSize

#subset df        
    df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
                 ((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()

    return(df.AFR.sum())

整个方法都用于对数据帧进行子集设置,当我的数据帧包含~55k行时,速度非常慢。有没有更快更有效的方法?你知道吗


Tags: theto数据posdfismywindow
1条回答
网友
1楼 · 发布于 2024-05-12 18:49:32

诀窍是下降到numpy数组。熊猫索引和切片速度很慢。你知道吗

import pandas as pd

df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
                   [2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])

chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values

def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
    return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
               if pos_start < j < pos_end and i == chrom_val)

filter_sum(chrom, pos, afr, 1, 10150, 10200)

# 1.1

相关问题 更多 >