将函数应用于pandas中的列集合，在整个数据帧列wis上“循环”问题的回答

将函数应用于pandas中的列集合，在整个数据帧列wis上“循环”

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

下面是一个测试示例，展示我正在努力实现的目标。这是一个玩具数据框： <pre><code>df = pd.DataFrame(np.random.randn(10,7),index=range(1,11),columns=headers) </code></pre> 它给予 <pre><code> Time A_x A_y A_z B_x B_y B_z 1 -0.075509 -0.123527 -0.547239 -0.453707 -0.969796 0.248761 1.369613 2 -0.206369 -0.112098 -1.122609 0.218538 -0.878985 0.566872 -1.048862 3 -0.194552 0.818276 -1.563931 0.097377 1.641384 -0.766217 -1.482096 4 0.502731 0.766515 -0.650482 -0.087203 -0.089075 0.443969 0.354747 5 1.411380 -2.419204 -0.882383 0.005204 -0.204358 -0.999242 -0.395236 6 1.036695 1.115630 0.081825 -1.038442 0.515798 -0.060016 2.669702 7 0.392943 0.226386 0.039879 0.732611 -0.073447 1.164285 1.034357 8 -1.253264 0.389148 0.158289 0.440282 -1.195860 0.872064 0.906377 9 -0.133580 -0.308314 -0.839347 -0.517989 0.652120 0.477232 -0.391767 10 0.623841 0.473552 0.059428 0.726088 -0.593291 -3.186297 -0.846863 </code></pre> 我要做的只是计算每个头（A和B）的向量长度，在本例中，为每个索引，除以<code>Time</code>列。因此，这个函数必须是<code>np.sqrt(A_x^2 + A_y^2 + A_z^2)</code>，当然对于B也是一样的。一、我想计算每一行的速度，但是有三个列会产生一个速度结果。 我试过使用<code>df.groupby</code>和<code>df.filter</code>循环遍历列，但我无法真正让它工作，因为我根本不确定如何将相同的函数有效地应用于数据帧的块，一次完成（很明显，一个是避免循环遍历行）。我试过了 <pre><code>df = df.apply(lambda x: np.sqrt(x.dot(x)), axis=1) </code></pre> 当然，这是可行的，但前提是输入数据框的列数（3）正确，如果更长，那么点积是在整行上计算的，而不是在我想要的三列的块中计算（因为这是与标记坐标相对应的，标记坐标是三维的）。 所以这就是我在上面的例子中最终想要得到的结果（下面的数组只是填充了随机数，而不是我试图计算的实际速度-只是为了显示我想要达到的形状）： <pre><code> Velocity_A Velocity_B 1 -0.975633 -2.669544 2 0.766405 -0.264904 3 0.425481 -0.429894 4 -0.437316 0.954006 5 1.073352 -1.475964 6 -0.647534 0.937035 7 0.082517 0.438112 8 -0.387111 -1.417930 9 -0.111011 1.068530 10 0.451979 -0.053333 </code></pre> 我的实际数据是50000 x 36（因此有12个x，y，z坐标标记），我想一次性计算速度，以避免迭代（如果可能的话）。还有一个相同长度的时间列（50000x1）。 你怎么做到的？ 谢谢，阿斯特里德

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你的计算比熊猫式的更简洁，我的意思是，如果你把你的数据帧仅仅看作一个大数组，那么计算就可以简洁地表达出来，而当你试图把数据帧和熔化、分组等纠缠在一起时，解决方案（至少是我提出的解决方案）就更复杂了 整个计算基本上可以用一行来表示： <pre><code>np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None] </code></pre> 所以这是一种新的方式： <pre><code>import numpy as np import pandas as pd import io content = ''' Time A_x A_y A_z B_x B_y B_z -0.075509 -0.123527 -0.547239 -0.453707 -0.969796 0.248761 1.369613 -0.206369 -0.112098 -1.122609 0.218538 -0.878985 0.566872 -1.048862 -0.194552 0.818276 -1.563931 0.097377 1.641384 -0.766217 -1.482096 0.502731 0.766515 -0.650482 -0.087203 -0.089075 0.443969 0.354747 1.411380 -2.419204 -0.882383 0.005204 -0.204358 -0.999242 -0.395236 1.036695 1.115630 0.081825 -1.038442 0.515798 -0.060016 2.669702 0.392943 0.226386 0.039879 0.732611 -0.073447 1.164285 1.034357 -1.253264 0.389148 0.158289 0.440282 -1.195860 0.872064 0.906377 -0.133580 -0.308314 -0.839347 -0.517989 0.652120 0.477232 -0.391767 0.623841 0.473552 0.059428 0.726088 -0.593291 -3.186297 -0.846863''' df = pd.read_table(io.BytesIO(content), sep='\s+', header=True) arr = df.values times = arr[:,0] arr = arr[:,1:] result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None] result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in list('AB')]) print(result) </code></pre> 会产生 <pre><code> Velocity_A Velocity_B 0 -9.555311 -22.467965 1 -5.568487 -7.177625 2 -9.086257 -12.030091 3 2.007230 1.144208 4 1.824531 0.775006 5 1.472305 2.623467 6 1.954044 3.967796 7 -0.485576 -1.384815 8 -7.736036 -6.722931 9 1.392823 5.369757 </code></pre> <hr/> 因为您的实际数据帧具有形状（50000，36），所以选择快速方法可能很重要。以下是一个基准： <pre><code>import numpy as np import pandas as pd import string N = 12 col_ids = string.letters[:N] df = pd.DataFrame( np.random.randn(50000, 3*N+1), columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')]) def using_numpy(df): arr = df.values times = arr[:,0] arr = arr[:,1:] result = np.sqrt((arr**2).reshape(arr.shape[0],-1,3).sum(axis=-1))/times[:,None] result = pd.DataFrame(result, columns=['Velocity_%s'%(x,) for x in col_ids]) return result def using_loop(df): results = pd.DataFrame(index=df.index) # the result container for id in col_ids: results['Velocity_'+id] = np.sqrt((df.filter(regex=id+'_')**2).sum(axis=1))/df.Time return results </code></pre> 使用<a href="http://ipython.org" rel="nofollow">IPython</a>： <pre><code>In [43]: %timeit using_numpy(df) 10 loops, best of 3: 34.7 ms per loop In [44]: %timeit using_loop(df) 10 loops, best of 3: 82 ms per loop </code></pre>

将函数应用于pandas中的列集合，在整个数据帧列wis上“循环”

1 个回答

相关Python问题