如何基于过滤另一个数据帧中的值来聚合一个数据帧中的统计信息？

2条回答

网友

1楼 · 编辑于 2024-06-10 10:28:49

样本数据和导入

import pandas
import numpy
import random

# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
         'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
         'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}

dfr = pd.DataFrame(rdata)

# display(dfr.head())
   ID   Top  Bottom
0   1  2010    3000
1   1  4300    4500
2   1  4550    5000
3   1  7100    7700
4   2  3200    4100

# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
        'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
        'v1': [np.random.randint(40, 50) for _ in range(rows)],
        'v2': np.random.rand(rows),
        'v3': [np.random.randint(20, 1000) for _ in range(rows)]}

df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)

# display(df.head())
   id  depth  v1        v2   v3
0   1   2004  48  0.517014  292
1   1   2004  41  0.997347  859
2   1   2006  42  0.278217  851
3   1   2006  49  0.570363   32
4   1   2009  43  0.462985  409

使用`dfr`的每一行从`df`中筛选和提取统计信息

关于如何处理"TypeError: Invalid comparison between dtype=float64 and str"，有很多答案。需要清除数值列中无法转换为数值类型的任何值
这段代码处理使用一个数据帧过滤和返回另一个数据帧的度量
对于dfr中的每一行：
- 过滤器df
- 为{}、{}和{}聚合{}和{}
- .T将mean和count行转置为列
- 转换为numpy数组
- 为3个平均值切片数组，并将数组附加到v_mean
- 为最大计数对数组进行切片，并将值附加到count
  - 如果数据中没有NaNs，它们可能都是相同的
将数组列表v_mean转换为数据帧，并将其连接到dfr_new
在dfr_new中添加counts列

v_mean = list()
counts = list()

for idx, (i, t, b) in dfr.iterrows():  # iterate through each row of dfr

    data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy()  # apply filters and get stats
    v_mean.append(data[:, 0])  # get the 3 means
    counts.append(data[:, 1].max())  # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data


# copy dfr to dfr_new
dfr_new = dfr.copy()

# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts

# display(dfr_new)
   ID   Top  Bottom    v1_mean   v2_mean     v3_mean  count
0   1  2010    3000  44.577491  0.496768  502.068266  542.0
1   1  4300    4500  44.555556  0.518066  530.968254  126.0
2   1  4550    5000  44.446281  0.538855  482.818182  242.0
3   1  7100    7700  44.348083  0.489983  506.681416  339.0
4   2  3200    4100  44.804040  0.487011  528.707071  495.0
5   2  4120    4180  45.096774  0.526687  520.967742   31.0
6   2  4300    5300  44.476980  0.529476  523.095764  543.0
7   2  5500    5520  46.000000  0.608876  430.500000   12.0
8   3  2300    2380  44.512195  0.456632  443.195122   41.0
9   3  3200    4500  44.554755  0.516616  501.841499  694.0

网友

2楼 · 编辑于 2024-06-10 10:28:49

编辑#2A:

注: 下面的示例数据帧与问题中发布的数据帧不完全相同

在这里发布一个新代码，使用Top和rdf中的Bottom来检查DEPTH中的ldf来使用for-loop计算每个组的.mean()。假定数据帧rdf没有任何重复项，则在rdf中创建一个对每行唯一的range_key

# Import libraries
import pandas as pd

# Create DataFrame
rdf = pd.DataFrame({
    'ID': [1,1,1,1,2,2,2,2,3,3],
    'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
    'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
    'ID': [1,1,1,1,1,1,1,1,2,2,3],
    'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
    'Value1':[45,43,40,28,40,23,34,54,40,38,43],
    'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
    'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})

# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)


# Flag each row for a range
for i in range(ldf.shape[0]):
    for j in range(rdf.shape[0]):
        d = ldf['DEPTH'][i]
        if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
            rkey = rdf['range_key'][j]
            ldf['range_key'][i]=rkey
            break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string


# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)


# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf

输出：

    ID  Top Bottom  Value1  Value2  Value3
0   1   2000    2500    39.0    0.2175  396.5
1   1   4300    4500    NaN NaN NaN
2   1   4500    5000    NaN NaN NaN
3   1   7100    7700    NaN NaN NaN
4   2   3200    4100    NaN NaN NaN
5   2   4120    4180    NaN NaN NaN
6   2   4300    5300    NaN NaN NaN
7   2   5500    5520    NaN NaN NaN
8   3   2300    2380    NaN NaN NaN
9   3   3200    4500    NaN NaN NaN

编辑#1:

下面的代码似乎有效。从上面发布的代码中向return添加了一个if-statement。不确定这是否是你想要的。它计算.sum()。将rdf中的第一个值更改为较低的范围，以匹配ldf中的数据

# Import libraries
import pandas as pd

# Create DataFrame
rdf = pd.DataFrame({
    'ID': [1,1,1,1,2,2,2,2,3,3],
    'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
    'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
    'ID': [1,1,1,1,1,1,1,1,2,2,3],
    'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
    'Value1':[45,43,40,28,40,23,34,54,40,38,43],
    'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
    'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})

##### Code from the question  (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']

def get_depths(x):
    n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
    if (n.shape[0]>0):
        return n['ID'].values[0],n['DEPTH'].sum()
    

test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)

输出：

test

    top  bottom   ID   Value1
0  2000    2500  1.0  14014.0
1  4300    4500  NaN      NaN
2  4500    5000  NaN      NaN
3  7100    7700  NaN      NaN
4  3200    4100  NaN      NaN
5  4120    4180  NaN      NaN
6  4300    5300  NaN      NaN
7  5500    5520  NaN      NaN
8  2300    2380  NaN      NaN
9  3200    4500  NaN      NaN

样本数据和导入

使用`dfr`的每一行从`df`中筛选和提取统计信息

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何基于过滤另一个数据帧中的值来聚合一个数据帧中的统计信息？

样本数据和导入

使用dfr的每一行从df中筛选和提取统计信息

相关问题 更多 >

编程相关推荐

热门问题

热门文章

使用`dfr`的每一行从`df`中筛选和提取统计信息

相关问题更多 >