Pandas最快的计算方法？

Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2 0 A 10 20 5 10 -5 0 -15 -10 1 B 20 30 15 20 5 10 -5 0 2 C 30 40 25 30 15 20 5 10

# Import required modules import numpy as np import pandas as pd import timeit # Original def method_1(): df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End']) df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None) # Store data for new columns in a dictionary new_columns = {} for index1, row1 in df1.iterrows(): for index2, row2 in df2.iterrows(): key_start = 'Start_Diff_' + str(index2) key_end = 'End_Diff_' + str(index2) if (key_start in new_columns): new_columns[key_start].append(row1[1]-row2[0]) else: new_columns[key_start] = [row1[1]-row2[0]] if (key_end in new_columns): new_columns[key_end].append(row1[2]-row2[1]) else: new_columns[key_end] = [row1[2]-row2[1]] # Add dictionary data as new columns for key, value in new_columns.items(): df1[key] = value # jezrael - https://stackoverflow.com/a/60843750/452587 def method_2(): df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End']) df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None) # Convert selected columns to 2d numpy array a = df1[['Start', 'End']].to_numpy() b = df2[[0, 1]].to_numpy() # Output is 3d array; convert it to 2d array c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1) # Generate columns names and with DataFrame.join; add to original cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')] df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index)) # sammywemmy - https://stackoverflow.com/a/60844078/452587 def method_3(): df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End']) df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None) # Create numpy arrays of df1 and df2 df1_start = df1.loc[:, 'Start'].to_numpy() df1_end = df1.loc[:, 'End'].to_numpy() df2_start = df2[0].to_numpy() df2_end = df2[1].to_numpy() # Use np tile to create shapes that allow elementwise subtraction tiled_start = np.tile(df1_start, (len(df2), 1)).T tiled_end = np.tile(df1_end, (len(df2), 1)).T # Subtract df2 from df1 start = np.subtract(tiled_start, df2_start) end = np.subtract(tiled_end, df2_end) # Create columns for start and end start_columns = [f'Start_Diff_{num}' for num in range(len(df2))] end_columns = [f'End_Diff_{num}' for num in range(len(df2))] # Create dataframes of start and end start_df = pd.DataFrame(start, columns=start_columns) end_df = pd.DataFrame(end, columns=end_columns) # Lump start and end into one dataframe lump = pd.concat([start_df, end_df], axis=1) # Sort the columns by the digits at the end filtered = lump.columns[lump.columns.str.contains('\d')] cols = sorted(filtered, key=lambda x: x[-1]) lump = lump.reindex(cols, axis='columns') # Hook lump back to df1 df1 = pd.concat([df1,lump],axis=1) print('Method 1:', timeit.timeit(method_1, number=3)) print('Method 2:', timeit.timeit(method_2, number=3)) print('Method 3:', timeit.timeit(method_3, number=3))

3条回答

网友

1楼 · 编辑于 2024-04-24 11:25:35

我建议在第一步中使用herenumpy-将所选列转换为2d numpy数组：

a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()

输出为3d数组，将其转换为2d array：

c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[  5  10  -5   0 -15 -10]
 [ 15  20   5  10  -5   0]
 [ 25  30  15  20   5  10]]

最后生成列名并使用^{}添加到原始列：

cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
  Name  Start  End  Start_Diff_0  End_Diff_0  Start_Diff_1  End_Diff_1  \
0    A     10   20             5          10            -5           0   
1    B     20   30            15          20             5          10   
2    C     30   40            25          30            15          20   

   Start_Diff_2  End_Diff_2  
0           -15         -10  
1            -5           0  
2             5          10

网友

2楼 · 编辑于 2024-04-24 11:25:35

不要使用iterrows()。如果您只是简单地减去值，请将矢量化与Numpy一起使用（Pandas也提供矢量化，但Numpy更快）

例如：

df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)

col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)

这里df3等于：

    Start_Diff_1    End_Diff_1
0           -5              0
1           5               10
2           15              20

您还可以通过执行以下操作来更改列名：

df2.columns = "Start_Diff_0 End_Diff_0".split()

可以使用f字符串更改循环中的列名，即f"Start_Diff_{i}"，其中i是循环中的一个数字

您还可以将多个数据帧与以下数据帧组合：

df = pd.concat([df1, df2],axis=1)

网友

3楼 · 编辑于 2024-04-24 11:25:35

这是一种方法：

 #create numpy arrays of df1 and 2

df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()

df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()

#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T

#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)

#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]

#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)

#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)

#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]

cols = sorted(filtered, key = lambda x: x[-1])

lump = lump.reindex(cols,axis='columns')

#hook lump back to df1
final = pd.concat([df1,lump],axis=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章