两个数据帧的矢量化而不是循环

date och cch och1 och2 och3 cch1 0 5/30/2012 -0.7 -0.7 3 -1 1 56 1 9/16/2013 0.9 -1.0 6 4 3 7 2 9/26/2013 2.5 5.4 2 3 2 4 3 8/26/2016 0.1 -0.7 4 3 5 10

for i in dffinal.index: df3=df2.copy() df3 = df3[df3['och1'] >dffinal['och1'].iloc[i]] df3 = df3[df3['och2'] >dffinal['och2'].iloc[i]] df3 = df3[df3['och3'] >dffinal['och3'].iloc[i]] df3 = df3[df3['cch1'] >dffinal['cch1'].iloc[i]] dffinal['LCH'][i] =df3["och"].mean() dffinal['L#'][i] =len(df3.index)

3条回答

网友

1楼 · 编辑于 2024-05-13 12:03:36

我能想到的没有循环的pandas方法的唯一方法是在重置索引并与df.all(1)进行比较之后的交叉连接

cols = ['och1','och2','och3','cch1']
u = df2.reset_index().assign(k=1).merge(
    dffinal.reset_index().assign(k=1),on='k',suffixes=('','_y'))
#for new Version of pandas there is a how='cross' included now

dffinal['NewLCH'] = (u[u[cols].gt(u[[f"{i}_y" for i in cols]].to_numpy()).all(1)]
                     .groupby("index_y")['och'].mean())

print(dffinal)

        date  id  och  och1  och2  och3  cch1  LCH  L#    NewLCH
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN  0.900000
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN  1.166667

网友

2楼 · 编辑于 2024-05-13 12:03:36

对于给定的dffinal行，在df2中选择行子集的逻辑可能很难避免迭代，但是使用此逻辑应该能够加快迭代方法的速度（希望可以加快很多）

（注意：如果您重复访问正在迭代的数据帧的行，请使用.iterrows，以便更简单（快速）地获取内容）

for i,row in dffinal.iterrows():
    och_array = df2.loc[(df3['och1'] >row['och1']) &\
          (df2['och2'] >row['och2']) &\
          (df2['och3'] >row['och3']) &\   
          (df2['cch1'] >row['cch1']),'och'].values
    dffinal.at[i,'LCH'] = och_array.mean()
    dffinal.at[i,'L#'] = len(och_array)

这避免了在dffinal中查找，避免了多次创建df的新副本。在没有数据样本的情况下无法进行测试，但我认为这会起作用

网友

3楼 · 编辑于 2024-05-13 12:03:36

这个答案基于https://stackoverflow.com/a/68197271/2954547，只是它使用了itertuples而不是iterrowsitertuples通常比iterrows更安全，因为它正确地保留了数据类型。见^{}文件的“注释”部分

它也是自包含的，因为它可以从上到下执行，而无需复制/粘贴数据等

注意，我在df1.itertuples上迭代，而不是df_final.itertuples永远不要对正在迭代的对象进行变异，也永远不要对正在变异的对象进行迭代。在适当的位置修改数据帧是变异的一种形式

import io

import pandas as pd


data1_txt = """
     date  id  och  och1  och2  och3  cch1  LCH  L#
3/27/2020   1 -2.1     3     3     1     5  NaN NaN
4/9/2020   2  2.0     1     2     1     3  NaN NaN
"""

data2_txt = """
     date  och  cch  och1  och2  och3  cch1
5/30/2012 -0.7 -0.7     3    -1     1    56
9/16/2013  0.9 -1.0     6     4     3     7
9/26/2013  2.5  5.4     2     3     2     4
8/26/2016  0.1 -0.7     4     3     5    10
"""

df1 = pd.read_fwf(io.StringIO(data1_txt), index_col='id')
df2 = pd.read_fwf(io.StringIO(data2_txt))

df_final = df1.copy()

for row in df1.itertuples():
    row_mask = (
        (df2['och1'] > row.och1) &
        (df2['och2'] > row.och2) &
        (df2['och3'] > row.och3) &
        (df2['cch1'] > row.cch1)
    )
    och_vals = df2.loc[row_mask, 'och']
    i = row.Index
    df_final.at[i, 'LCH'] = och_vals.mean()
    df_final.at[i, 'L#'] = len(och_vals)

print(df_final)

输出是

         date  och  och1  och2  och3  cch1  LCH  L#       LCH   L#
id                                                                
1   3/27/2020 -2.1     3     3     1     5  NaN NaN  0.900000  1.0
2    4/9/2020  2.0     1     2     1     3  NaN NaN  1.166667  3.0

相关问题更多 >

编程相关推荐

热门问题

热门文章