在每组数据框架内创建点列表问题的回答

在每组数据框架内创建点列表

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据集，其结构如下所示<code>example_df</code>： <pre><code>example_df = pd.DataFrame({'measurement_id': np.concatenate([[0] * 300, [1] * 300]), 'min': np.concatenate([np.repeat(range(0, 30), 10), np.repeat(range(0, 30), 10)]), 'grp': list(np.repeat(['A', 'B'], 5)) * 60, 'grp2': list(np.random.choice([0, 1, 2], 10)) * 60, 'obj': np.array(list(range(0, 10)) * 60), 'x': np.random.normal(0.0, 10.0, 600), 'y': np.random.normal(50.0, 40.0, 600)}) </code></pre> 我还有一个函数，它将一组点作为输入并执行一些计算。我想准备我的数据并在分组数据框中创建一个点列表。你知道吗 我目前的解决方案如下： <pre><code>def df_to_points(df): points = [] for index, row in df.iterrows(): points.append(tuple(row)) return(points) res = example_df \ .groupby(['measurement_id', 'min', 'grp']) \ .apply(lambda x: [df_to_points(g[['x', 'y']]) for _, g in x.groupby('grp2')]) res.head(5) measurement_id min grp 0 0 A [[(7.435996920897324, 63.64844826366264), (-9.... 1 B [[(-10.213911323779579, 108.64263032884301), (... 2 A [[(6.004534743892181, 38.11898691750269), (12.... 3 B [[(-11.486905682289555, 68.26172126981378), (-... 4 A [[(7.5612638943199295, 28.756743327333556), (-... </code></pre> 其中<code>res</code>系列的每一行如下所示： <pre><code>[[(7.435996920897324, 63.64844826366264), (-9.722976872232584, 11.831678494223155), (10.809492206072777, 82.9238481225157), (-7.918248246978473, 58.46902598333271)], [(6.270634566510545, 59.10653240815831), (-5.765185730532471, 22.232739287056663), (-13.129531349093371, 85.02932179274353)], [(0.6686875099768917, 60.634711491838786), (-7.373072676442981, 30.897262347426693), (-11.489744246260528, 6.834296232736001)]] </code></pre> 问题是，我原来的数据帧有几百万行，感觉这个解决方案可以从一些优化中受益。你知道吗 示例的当前运行时为： <pre><code>%timeit res = example_df \ .groupby(['measurement_id', 'min', 'grp']) \ .apply(lambda x: [df_to_points(g[['x', 'y']]) for _, g in x.groupby('grp2')]) 289 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre> 因此，我的问题是： <ol> <li>用<code>numpy</code>多维数组替换元组列表会提高性能吗？你知道吗</li> <li>为了提高速度，是否有任何应避免的重大瓶颈？你知道吗</li> </ol> @Edit:一个例子，在<code>grp</code>定义的组中有不同数量的对象 <pre><code>example_df2 = pd.DataFrame({'measurement_id': np.concatenate([[0] * 300, [1] * 300]), 'min': np.concatenate([np.repeat(range(0, 30), 10), np.repeat(range(0, 30), 10)]), 'grp': list(np.repeat(['A', 'B', 'C'], [4, 4, 2])) * 60, 'grp2': list(np.random.choice([0, 1, 2], 10)) * 60, 'obj': np.array(list(range(0, 10)) * 60), 'x': np.random.normal(0.0, 10.0, 600), 'y': np.random.normal(50.0, 40.0, 600)}) </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在每组数据框架内创建点列表

1 个回答

相关Python问题