按数据帧分组并选择序列中的最大值

2024-06-16 13:22:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas数据框,表示几个目标涡轮机每10度点之间的高程差。我选择了遵循标准的高程差,并添加了一列,表示它们是否连续(metDegDiff=10表示连续点)

如何通过targTurb在3个或更多连续10度点中选择elevDif的最大值

ridgeDF2 = pd.DataFrame(data = {
    'MetID':['A06_40','A06_50','A06_60','A06_70','A06_80','A06_100','A06_110','A06_140','A07_110','A07_130','A07_140','A08_100','A08_110','A08_120','A08_130','A08_220'],
    'targTurb':['A06','A06','A06','A06','A06','A06','A06','A06','A07','A07','A07','A08','A08','A08','A08','A08'],
    'metDeg':[30,50,60,70,80,100,110,140,110,130,140,100,110,120,130,220],
    'elevDif':[1.433234, 1.602997,3.227997,2.002991,2.414001,2.96402,1.513,1.793976,1.612,2.429993,1.639008,1.500977,3.048004,2.174011,1.813995,1.527008],
    'metDegDiff':[20,10,10,10,10,20,10,30,-30,20,10,-40,10,10,10,30]})
[Dbg]>>> ridgeDF2
      MetID targTurb  metDeg   elevDif  metDegDiff
0    A06_40      A06      30  1.433234          20
1    A06_50      A06      50  1.602997          10
2    A06_60      A06      60  3.227997          10
3    A06_70      A06      70  2.002991          10
4    A06_80      A06      80  2.414001          10
5   A06_100      A06     100  2.964020          20
6   A06_110      A06     110  1.513000          10
7   A06_140      A06     140  1.793976          30
8   A07_110      A07     110  1.612000         -30
9   A07_130      A07     130  2.429993          20
10  A07_140      A07     140  1.639008          10
11  A08_100      A08     100  1.500977         -40
12  A08_110      A08     110  3.048004          10
13  A08_120      A08     120  2.174011          10
14  A08_130      A08     130  1.813995          10
15  A08_220      A08     220  1.527008          30

在该示例中,对于A06,有4行具有连续的10个metDeg值(行1、2、3和4),对于A8,有3行(行12、13和14)。请注意,这两个系列的长度为3或更多

因此,输出将是这两个选定系列内的最大elevDif。像这样:

MetID  targTurb  metDeg   elevDif  metDegDiff
A06_60      A06      60  3.227997          10
A08_110     A08     110  3.048004          10

Tags: 数据目标pandas标准高程涡轮机a06a08
3条回答

链式解决方案

   ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]

逐步解决方案

.cumsum()metDegDiff创建第一个元素不是10的组

ridgeDF2=ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())

应用多个筛选器以除去上面生成的组中不等于10的metDegDiff,并保留连续值计数=10等于或大于3的组。我通过链groupby().transform()和布尔选择来实现这一点

g=ridgeDF2[ridgeDF2.groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3))]

从上面剩下的内容中,选择具有最大值的索引

g.loc[g.groupby('targTurb')['elevDif'].idxmax()]

结果

  MetID targTurb  metDeg   elevDif  metDegDiff
2    A06_60      A06      60  3.227997          10
12  A08_110      A08     110  3.048004          10

时机

%timeit ridgeDF2.loc[ridgeDF2[((ridgeDF2.assign(group=(ridgeDF2.metDegDiff!=10).cumsum())).groupby('group')['metDegDiff'].transform(lambda x: (x==10)& (x.count()>=3)))].groupby('targTurb')['elevDif'].idxmax()]
9.01 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

您可以使用shiftcumsum为metDegDiff中相同的连续值和相同的targetrb创建一个组列。然后,您可以使用此组列来选择组中有更多或相等(ge)3个值的位置,这些值是由map组号和组号的value_counts获得的,并且metDegDiff中的值等于(eq)到10。现在您只有感兴趣的组,您可以在elevDif上sort_values和在group列上drop_duplicates保留每个组的最大值。如有必要,以drop列gr和sort_valuesper targTurb结束

ridgeDF2['metDegDiff'] = ridgeDF2['metDeg'].diff() #I assume calculated this way

#create a group number with same consecutive values and same targTurb
ridgeDF2['gr'] = (ridgeDF2['metDegDiff'].ne(ridgeDF2['metDegDiff'].shift())
                  |(ridgeDF2['targTurb'].ne(ridgeDF2['targTurb'].shift()))
                 ).cumsum()

#get the result dataframe
res_ = (ridgeDF2.loc[ridgeDF2['metDegDiff'].eq(10) #row with 10 in metDegDiff
                     &ridgeDF2['gr'].map(ridgeDF2['gr'].value_counts()).ge(3)] #and row with group of greater equal 3 values
                .sort_values(by='elevDif') # ascending sort of the elevDif
                .drop_duplicates('gr', keep='last') #keep the last row pergroup having higher number
                .drop('gr', axis=1) #remove the extra group column
                .sort_values('targTurb') #if you need
       )

你可以得到你想要的行

print (res_)
      MetID targTurb  metDeg   elevDif  metDegDiff
2    A06_60      A06      60  3.227997        10.0
12  A08_110      A08     110  3.048004        10.0

下面的代码应该可以工作。您可以分别运行每一行来查看发生了什么

ridgeDF2['t/f'] = ridgeDF2['metDegDiff'] != 10
ridgeDF2['t/f'] = ridgeDF2['t/f'].shift().fillna(0).cumsum()
ridgeDF2['count'] = ridgeDF2.groupby('t/f')['t/f'].transform(len)-1
ridgeDF2['count'] = np.where(ridgeDF2['count'] >= 3,True,False)
ridgeDF2.loc[ridgeDF2['metDegDiff'] != 10,'count'] = False
highest = ridgeDF2.loc[ridgeDF2['count'] == True]
highest = highest.loc[highest.groupby(['targTurb','metDegDiff','t/f'])['elevDif'].idxmax()]
highest.drop(columns = ['t/f','count'])

相关问题 更多 >