如何基于过滤另一个数据帧中的值来聚合一个数据帧中的统计信息?

2024-06-10 10:28:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧。rdf是我试图使用的参考数据框,用于定义间隔(顶部和底部)以计算平均值(此间隔之间的所有深度),但使用ldf实际运行该计算,因为它包含值。rdf定义了每个id号的顶部和底部,每个id号都应该运行平均值。每个id有多个间隔

rdf的格式如下:

ID   Top    Bottom 
1    2010   3000
1    4300   4500
1    4550   5000
1    7100   7700
2    3200   4100
2    4120   4180
2    4300   5300
2    5500   5520
3    2300   2380
3    3200   4500

ldf的来源如下:

ID   Depth(ft)    Value1   Value2   Value3 
1    2000         45       .32      423
1    2000.5       43       .33      500
1    2001         40       .12      643
1    2001.5       28       .10      20
1    2002         40       .10      34
1    2002.5       23       .11      60
1    2003         34       .08      900
1    2003.5       54       .04      1002
2    2000         40       .28      560
2    2000         38       .25      654
...
3    2000         43       .30      343

我想使用rdf定义区间的顶部和底部,以计算Value1、Value2和Value3的平均值。我还希望记录一个计数(并非所有间隔之间的值都必然存在,因此它可能小于底部-顶部的差异)。然后将修改rdf以生成新文件:

新rdf的格式如下:

ID   Top    Bottom   avgValue1   avgValue2   avgValue3  ThicknessCount(ft)
1    2010   3000     54          .14         456        74
1    4300   4500     23          .18         632        124    
1    4550   5000     34          .24         780        111
1    7100   7700     54          .19         932        322
2    3200   4100     52          .32         134        532
2    4120   4180     16          .11         111        32
2    4300   5300     63          .29         872        873
2    5500   5520     33          .27         1111       9
3    2300   2380     63          .13         1442       32
3    3200   4500     37          .14         1839       87

我一直在寻找最好的方法。我试着模仿这个时间序列示例:Sum set of values from pandas dataframe within certain time frame

但它似乎不可翻译:

import pandas as pd

Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']

def get_depths(x):
    n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
    return n['ID'].values[0],n['DEPTH'].sum()

test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)

我得到"TypeError: Invalid comparison between dtype=float64 and str" 如果我使用他们在帖子中制作的样本,它是有效的,但它不适用于我的数据。我也希望有一个更简单的方法来做到这一点


Tags: 数据testid间隔定义top格式rdf
2条回答

样本数据和导入

import pandas
import numpy
import random

# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
         'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
         'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}

dfr = pd.DataFrame(rdata)

# display(dfr.head())
   ID   Top  Bottom
0   1  2010    3000
1   1  4300    4500
2   1  4550    5000
3   1  7100    7700
4   2  3200    4100

# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
        'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
        'v1': [np.random.randint(40, 50) for _ in range(rows)],
        'v2': np.random.rand(rows),
        'v3': [np.random.randint(20, 1000) for _ in range(rows)]}

df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)

# display(df.head())
   id  depth  v1        v2   v3
0   1   2004  48  0.517014  292
1   1   2004  41  0.997347  859
2   1   2006  42  0.278217  851
3   1   2006  49  0.570363   32
4   1   2009  43  0.462985  409

使用dfr的每一行从df中筛选和提取统计信息

  • 关于如何处理"TypeError: Invalid comparison between dtype=float64 and str",有很多答案。需要清除数值列中无法转换为数值类型的任何值
  • 这段代码处理使用一个数据帧过滤和返回另一个数据帧的度量
  • 对于dfr中的每一行:
    • 过滤器df
    • 为{}、{}和{}聚合{}和{}
    • .Tmeancount行转置为列
    • 转换为numpy数组
    • 为3个平均值切片数组,并将数组附加到v_mean
    • 为最大计数对数组进行切片,并将值附加到count
      • 如果数据中没有NaNs,它们可能都是相同的
  • 将数组列表v_mean转换为数据帧,并将其连接到dfr_new
  • dfr_new中添加counts
v_mean = list()
counts = list()

for idx, (i, t, b) in dfr.iterrows():  # iterate through each row of dfr

    data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy()  # apply filters and get stats
    v_mean.append(data[:, 0])  # get the 3 means
    counts.append(data[:, 1].max())  # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data


# copy dfr to dfr_new
dfr_new = dfr.copy()

# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts

# display(dfr_new)
   ID   Top  Bottom    v1_mean   v2_mean     v3_mean  count
0   1  2010    3000  44.577491  0.496768  502.068266  542.0
1   1  4300    4500  44.555556  0.518066  530.968254  126.0
2   1  4550    5000  44.446281  0.538855  482.818182  242.0
3   1  7100    7700  44.348083  0.489983  506.681416  339.0
4   2  3200    4100  44.804040  0.487011  528.707071  495.0
5   2  4120    4180  45.096774  0.526687  520.967742   31.0
6   2  4300    5300  44.476980  0.529476  523.095764  543.0
7   2  5500    5520  46.000000  0.608876  430.500000   12.0
8   3  2300    2380  44.512195  0.456632  443.195122   41.0
9   3  3200    4500  44.554755  0.516616  501.841499  694.0

编辑#2A:

注: 下面的示例数据帧与问题中发布的数据帧不完全相同

在这里发布一个新代码,使用Toprdf中的Bottom来检查DEPTH中的ldf来使用for-loop计算每个组的.mean()。假定数据帧rdf没有任何重复项,则在rdf中创建一个对每行唯一的range_key

# Import libraries
import pandas as pd

# Create DataFrame
rdf = pd.DataFrame({
    'ID': [1,1,1,1,2,2,2,2,3,3],
    'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
    'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
    'ID': [1,1,1,1,1,1,1,1,2,2,3],
    'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
    'Value1':[45,43,40,28,40,23,34,54,40,38,43],
    'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
    'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})

# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)


# Flag each row for a range
for i in range(ldf.shape[0]):
    for j in range(rdf.shape[0]):
        d = ldf['DEPTH'][i]
        if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
            rkey = rdf['range_key'][j]
            ldf['range_key'][i]=rkey
            break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string


# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)


# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf

输出:

    ID  Top Bottom  Value1  Value2  Value3
0   1   2000    2500    39.0    0.2175  396.5
1   1   4300    4500    NaN NaN NaN
2   1   4500    5000    NaN NaN NaN
3   1   7100    7700    NaN NaN NaN
4   2   3200    4100    NaN NaN NaN
5   2   4120    4180    NaN NaN NaN
6   2   4300    5300    NaN NaN NaN
7   2   5500    5520    NaN NaN NaN
8   3   2300    2380    NaN NaN NaN
9   3   3200    4500    NaN NaN NaN

编辑#1:

下面的代码似乎有效。从上面发布的代码中向return添加了一个if-statement。不确定这是否是你想要的。它计算.sum()。将rdf中的第一个值更改为较低的范围,以匹配ldf中的数据

# Import libraries
import pandas as pd

# Create DataFrame
rdf = pd.DataFrame({
    'ID': [1,1,1,1,2,2,2,2,3,3],
    'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
    'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
    'ID': [1,1,1,1,1,1,1,1,2,2,3],
    'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
    'Value1':[45,43,40,28,40,23,34,54,40,38,43],
    'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
    'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})

##### Code from the question  (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']

def get_depths(x):
    n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
    if (n.shape[0]>0):
        return n['ID'].values[0],n['DEPTH'].sum()
    

test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)

输出:

test

    top  bottom   ID   Value1
0  2000    2500  1.0  14014.0
1  4300    4500  NaN      NaN
2  4500    5000  NaN      NaN
3  7100    7700  NaN      NaN
4  3200    4100  NaN      NaN
5  4120    4180  NaN      NaN
6  4300    5300  NaN      NaN
7  5500    5520  NaN      NaN
8  2300    2380  NaN      NaN
9  3200    4500  NaN      NaN

相关问题 更多 >