将函数应用于pandas中的列集合，在整个数据帧列wis上“循环”

Time A_x A_y A_z B_x B_y B_z 1 -0.075509 -0.123527 -0.547239 -0.453707 -0.969796 0.248761 1.369613 2 -0.206369 -0.112098 -1.122609 0.218538 -0.878985 0.566872 -1.048862 3 -0.194552 0.818276 -1.563931 0.097377 1.641384 -0.766217 -1.482096 4 0.502731 0.766515 -0.650482 -0.087203 -0.089075 0.443969 0.354747 5 1.411380 -2.419204 -0.882383 0.005204 -0.204358 -0.999242 -0.395236 6 1.036695 1.115630 0.081825 -1.038442 0.515798 -0.060016 2.669702 7 0.392943 0.226386 0.039879 0.732611 -0.073447 1.164285 1.034357 8 -1.253264 0.389148 0.158289 0.440282 -1.195860 0.872064 0.906377 9 -0.133580 -0.308314 -0.839347 -0.517989 0.652120 0.477232 -0.391767 10 0.623841 0.473552 0.059428 0.726088 -0.593291 -3.186297 -0.846863

Velocity_A Velocity_B 1 -0.975633 -2.669544 2 0.766405 -0.264904 3 0.425481 -0.429894 4 -0.437316 0.954006 5 1.073352 -1.475964 6 -0.647534 0.937035 7 0.082517 0.438112 8 -0.387111 -1.417930 9 -0.111011 1.068530 10 0.451979 -0.053333

3条回答

网友

1楼 · 编辑于 2024-05-13 21:15:30

我至少会在标记标识符上做一个循环，但不用担心，这是一个非常快速的循环，它只确定筛选模式以获得正确的列：

df = pd.DataFrame(np.random.randn(10,7), index=range(1,11), columns='Time A_x A_y A_z B_x B_y B_z'.split())

col_ids = ['A', 'B'] # I guess you can create that one easily

results = pd.DataFrame(index=df.index) # the result container

for id in col_ids:
    results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time

网友

2楼 · 编辑于 2024-05-13 21:15:30

你的计算比熊猫式的更简洁，我的意思是，如果你把你的数据帧仅仅看作一个大数组，那么计算就可以简洁地表达出来，而当你试图把数据帧和熔化、分组等纠缠在一起时，解决方案（至少是我提出的解决方案）就更复杂了

整个计算基本上可以用一行来表示：

np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]

所以这是一种新的方式：

import numpy as np
import pandas as pd
import io
content = '''
Time       A_x       A_y       A_z       B_x       B_y       B_z
-0.075509 -0.123527 -0.547239 -0.453707 -0.969796  0.248761  1.369613
-0.206369 -0.112098 -1.122609  0.218538 -0.878985  0.566872 -1.048862
-0.194552  0.818276 -1.563931  0.097377  1.641384 -0.766217 -1.482096
 0.502731  0.766515 -0.650482 -0.087203 -0.089075  0.443969  0.354747
 1.411380 -2.419204 -0.882383  0.005204 -0.204358 -0.999242 -0.395236
 1.036695  1.115630  0.081825 -1.038442  0.515798 -0.060016  2.669702
 0.392943  0.226386  0.039879  0.732611 -0.073447  1.164285  1.034357
-1.253264  0.389148  0.158289  0.440282 -1.195860  0.872064  0.906377
-0.133580 -0.308314 -0.839347 -0.517989  0.652120  0.477232 -0.391767
 0.623841  0.473552  0.059428  0.726088 -0.593291 -3.186297 -0.846863'''

df = pd.read_table(io.BytesIO(content), sep='\s+', header=True)

arr = df.values
times = arr[:,0]
arr = arr[:,1:]
result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')])
print(result)

会产生

   Velocity_A  Velocity_B
0   -9.555311  -22.467965
1   -5.568487   -7.177625
2   -9.086257  -12.030091
3    2.007230    1.144208
4    1.824531    0.775006
5    1.472305    2.623467
6    1.954044    3.967796
7   -0.485576   -1.384815
8   -7.736036   -6.722931
9    1.392823    5.369757

因为您的实际数据帧具有形状（50000，36），所以选择快速方法可能很重要。以下是一个基准：

import numpy as np
import pandas as pd
import string

N = 12
col_ids = string.letters[:N]
df = pd.DataFrame(
    np.random.randn(50000, 3*N+1), 
    columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids
                      for coord in list('xyz')])


def using_numpy(df):
    arr = df.values
    times = arr[:,0]
    arr = arr[:,1:]
    result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None]
    result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in col_ids])
    return result

def using_loop(df):
    results = pd.DataFrame(index=df.index) # the result container
    for id in col_ids:
        results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time
    return results

使用IPython：

In [43]: %timeit using_numpy(df)
10 loops, best of 3: 34.7 ms per loop

In [44]: %timeit using_loop(df)
10 loops, best of 3: 82 ms per loop

网友

3楼 · 编辑于 2024-05-13 21:15:30

一个可能的开始。

筛选出与特定向量对应的列名。例如

In [20]: filter(lambda x: x.startswith("A_"),df.columns)
Out[20]: ['A_x', 'A_y', 'A_z']

从数据框中选择这些列

In [22]: df[filter(lambda x: x.startswith("A_"),df.columns)]
Out[22]: 
         A_x       A_y       A_z
1  -0.123527 -0.547239 -0.453707
2  -0.112098 -1.122609  0.218538
3   0.818276 -1.563931  0.097377
4   0.766515 -0.650482 -0.087203
5  -2.419204 -0.882383  0.005204
6   1.115630  0.081825 -1.038442
7   0.226386  0.039879  0.732611
8   0.389148  0.158289  0.440282
9  -0.308314 -0.839347 -0.517989
10  0.473552  0.059428  0.726088

因此，使用这种技术可以得到3列的数据块。例如。

column_initials = ["A","B"]
for column_initial in column_initials:
    df["Velocity_"+column_initial]=df[filter(lambda x: x.startswith(column_initial+"_"),df.columns)].apply(lambda x: np.sqrt(x.dot(x)), axis=1)/df.Time


In [32]: df[['Velocity_A','Velocity_B']]
Out[32]: 
    Velocity_A  Velocity_B
1    -9.555311  -22.467965
2    -5.568487   -7.177625
3    -9.086257  -12.030091
4     2.007230    1.144208
5     1.824531    0.775006
6     1.472305    2.623467
7     1.954044    3.967796
8    -0.485576   -1.384815
9    -7.736036   -6.722931
10    1.392823    5.369757

我得到的答案和你的不一样。但是，我借用了你的df.apply(lambda x: np.sqrt(x.dot(x)), axis=1)并假设它是正确的。

希望这有帮助。

相关问题更多 >

编程相关推荐

热门问题

热门文章