基于3列的数据帧中的2维存储箱

2024-04-29 09:39:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个基于3列的熊猫数据框架中创建二维容器。下面是我的DataFrame中的一个片段:

      Scatters  N   z           Dist_first
---------------------------------------
0     0         0   0.096144    2.761508
1     1         0   -8.229910   17.403039
2     2         0   0.038125    21.466233
3     3         0   -2.050480   29.239867
4     4         0   -1.620470   NaN
5     5         0   -1.975930   NaN
6     6         0   -11.672200  NaN
7     7         0   -16.629000  26.554049
8     8         0   0.096002    NaN
9     9         0   0.176049    NaN
10    10        0   0.176005    NaN
11    11        0   0.215408    NaN
12    12        0   0.255889    NaN
13    13        0   0.301834    27.700308
14    14        0   -29.593600  9.155065
15    15        1   -2.582290   NaN
16    16        1   0.016441    2.220946
17    17        1   -17.329100  NaN
18    18        1   -5.442320   34.520919
19    19        1   0.001741    39.579189

对于我的结果,每个Dist_首先应该与组“N”中比距离本身更低的所有“z<;=0”合并。“Scatters”是我代码早期操作留下的索引的副本,在这里不相关。尽管如此,我还是开始使用它,而不是下面示例中的索引。距离和z的箱子分别以10米和0.1米为步长,我可以通过循环数据帧组获得结果:

# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
    N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
    # fill nans with zeros to allow 
    N[j+1] = N[j+1].fillna(0)
    # make sure no value is repeated
    if j+1 > 1:
        N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)

# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan

# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
            
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
    binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
    binned = binned.size().unstack()
    ## rename
    binned.index = N_v.index; binned.columns = N_h.index
    ## and sum up with earlier chunks
    V = V+binned

这段代码工作正常,我共享的一小段数据的结果如下所示:

Distance [m]    0.0     10.0    20.0    30.0    40.0
Depth [m]                   
----------------------------------------------------
0.0     1   1   1   4   2
0.1     1   2   2   4   0
0.2     0   3   0   3   0
0.3     0   2   0   2   0
0.4     0   0   0   0   0

但是,整个数据集非常大(每个数据集的行数均超过3亿行),因此不能在所有行中循环。因此,我正在寻找一些矢量化的解决方案


Tags: to数据forindexdistcountnpnan