提高Pandas合并绩效问题的回答

提高Pandas合并绩效

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

正如其他文章所建议的，我特别没有Pands Merge的性能问题，但是我有一个类，其中有很多方法，在数据集上做了很多合并。 这个班大约有10个小组，大约有15个合并。虽然groupby相当快，但在类的1.5秒的总执行时间中，这15个合并调用大约需要0.7秒。 我想在合并调用中加快性能。因为我将有大约4000次迭代，因此在一次迭代中节省0.5秒将导致整体性能降低大约30分钟，这将是非常好的。 有什么建议我应该试试吗？我试过：赛松麻木，麻木比较慢。 谢谢 编辑1：添加示例代码段：我的合并语句： <pre><code>tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left') tmp = tmpDf tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left') tmp = tmpDf tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left') tmp = tmpDf tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left') tmp = tmpDf tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left') </code></pre> 通过实现连接，我合并了以下条件： <pre><code>dat = self.data.set_index('APPT_NBR') t1.set_index('APPT_NBR', inplace=True) t2.set_index('APPT_NBR', inplace=True) t3.set_index('APPT_NBR', inplace=True) t4.set_index('APPT_NBR', inplace=True) t5.set_index('APPT_NBR', inplace=True) tmpDf = dat.join(t1, how='left') tmpDf = tmpDf.join(t2, how='left') tmpDf = tmpDf.join(t3, how='left') tmpDf = tmpDf.join(t4, how='left') tmpDf = tmpDf.join(t5, how='left') tmpDf.reset_index(inplace=True) </code></pre> 注意，所有这些都是名为：def merge_early_created_values（self）：的函数的一部分 当我在profilehooks中执行timedcall时 <pre><code>@timedcall(immediate=True) def merge_earlier_created_values(self): </code></pre> 我得到以下结果： 该方法的分析结果给出： <pre><code>@profile(immediate=True) def merge_earlier_created_values(self): </code></pre> 使用Merge分析函数如下： <pre><code>*** PROFILER RESULTS *** merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122) function called 1 times 71665 function calls (70588 primitive calls) in 0.524 seconds Ordered by: cumulative time, internal time, call count List reduced from 563 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.012 0.012 0.524 0.524 get_prev_data_by_date.py:122(merge_earlier_created_values) 14 0.000 0.000 0.285 0.020 generic.py:1901(_update_inplace) 14 0.000 0.000 0.285 0.020 generic.py:1402(_maybe_update_cacher) 19 0.000 0.000 0.284 0.015 generic.py:1492(_check_setitem_copy) 7 0.283 0.040 0.283 0.040 {built-in method gc.collect} 15 0.000 0.000 0.181 0.012 generic.py:1842(drop) 10 0.000 0.000 0.153 0.015 merge.py:26(merge) 10 0.000 0.000 0.140 0.014 merge.py:201(get_result) 8/4 0.000 0.000 0.126 0.031 decorators.py:65(wrapper) 4 0.000 0.000 0.126 0.031 frame.py:3028(drop_duplicates) 1 0.000 0.000 0.102 0.102 get_prev_data_by_date.py:264(recreate_previous_cartons) 1 0.000 0.000 0.101 0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date) 1 0.000 0.000 0.098 0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type) 10 0.000 0.000 0.092 0.009 internals.py:4455(concatenate_block_managers) 10 0.001 0.000 0.088 0.009 internals.py:4471(<listcomp>) 120 0.001 0.000 0.084 0.001 internals.py:4559(concatenate_join_units) 266 0.004 0.000 0.067 0.000 common.py:733(take_nd) 120 0.000 0.000 0.061 0.001 internals.py:4569(<listcomp>) 120 0.003 0.000 0.061 0.001 internals.py:4814(get_reindexed_values) 1 0.000 0.000 0.059 0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status) 10 0.000 0.000 0.038 0.004 merge.py:322(_get_join_info) 10 0.001 0.000 0.036 0.004 merge.py:516(_get_join_indexers) 25 0.001 0.000 0.024 0.001 merge.py:687(_factorize_keys) 74 0.023 0.000 0.023 0.000 {pandas.algos.take_2d_axis1_object_object} 50 0.022 0.000 0.022 0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects} 120 0.003 0.000 0.022 0.000 internals.py:4479(get_empty_dtype_and_na) 88 0.000 0.000 0.021 0.000 frame.py:1969(__getitem__) 1 0.000 0.000 0.019 0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers) 39 0.000 0.000 0.018 0.000 internals.py:3495(reindex_indexer) 537 0.017 0.000 0.017 0.000 {built-in method numpy.core.multiarray.empty} 15 0.000 0.000 0.017 0.001 ops.py:725(wrapper) 15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array) 24 0.000 0.000 0.014 0.001 internals.py:3625(take) 10 0.000 0.000 0.014 0.001 merge.py:157(__init__) 10 0.000 0.000 0.014 0.001 merge.py:382(_get_merge_keys) 15 0.008 0.001 0.013 0.001 ops.py:662(na_op) 234 0.000 0.000 0.013 0.000 common.py:158(isnull) 234 0.001 0.000 0.013 0.000 common.py:179(_isnull_new) 15 0.000 0.000 0.012 0.001 generic.py:1609(take) 20 0.000 0.000 0.012 0.001 generic.py:2191(reindex) </code></pre> 使用联接进行分析的方法如下： <pre><code>65079 function calls (63990 primitive calls) in 0.550 seconds Ordered by: cumulative time, internal time, call count List reduced from 592 to 40 due to restriction <40> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.016 0.016 0.550 0.550 get_prev_data_by_date.py:122(merge_earlier_created_values) 14 0.000 0.000 0.295 0.021 generic.py:1901(_update_inplace) 14 0.000 0.000 0.295 0.021 generic.py:1402(_maybe_update_cacher) 19 0.000 0.000 0.294 0.015 generic.py:1492(_check_setitem_copy) 7 0.293 0.042 0.293 0.042 {built-in method gc.collect} 10 0.000 0.000 0.173 0.017 generic.py:1842(drop) 10 0.000 0.000 0.139 0.014 merge.py:26(merge) 8/4 0.000 0.000 0.138 0.034 decorators.py:65(wrapper) 4 0.000 0.000 0.138 0.034 frame.py:3028(drop_duplicates) 10 0.000 0.000 0.132 0.013 merge.py:201(get_result) 5 0.000 0.000 0.122 0.024 frame.py:4324(join) 5 0.000 0.000 0.122 0.024 frame.py:4371(_join_compat) 1 0.000 0.000 0.111 0.111 get_prev_data_by_date.py:264(recreate_previous_cartons) 1 0.000 0.000 0.103 0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date) 1 0.000 0.000 0.099 0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type) 10 0.000 0.000 0.093 0.009 internals.py:4455(concatenate_block_managers) 10 0.001 0.000 0.089 0.009 internals.py:4471(<listcomp>) 100 0.001 0.000 0.085 0.001 internals.py:4559(concatenate_join_units) 205 0.003 0.000 0.068 0.000 common.py:733(take_nd) 100 0.000 0.000 0.060 0.001 internals.py:4569(<listcomp>) 100 0.001 0.000 0.060 0.001 internals.py:4814(get_reindexed_values) 1 0.000 0.000 0.056 0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status) 10 0.000 0.000 0.033 0.003 merge.py:322(_get_join_info) 52 0.031 0.001 0.031 0.001 {pandas.algos.take_2d_axis1_object_object} 5 0.000 0.000 0.030 0.006 base.py:2329(join) 37 0.001 0.000 0.027 0.001 internals.py:2754(apply) 6 0.000 0.000 0.024 0.004 frame.py:2763(set_index) 7 0.000 0.000 0.023 0.003 merge.py:516(_get_join_indexers) 2 0.000 0.000 0.022 0.011 base.py:2483(_join_non_unique) 7 0.000 0.000 0.021 0.003 generic.py:2950(copy) 7 0.000 0.000 0.021 0.003 internals.py:3046(copy) 84 0.000 0.000 0.020 0.000 frame.py:1969(__getitem__) 19 0.001 0.000 0.019 0.001 merge.py:687(_factorize_keys) 100 0.002 0.000 0.019 0.000 internals.py:4479(get_empty_dtype_and_na) 1 0.000 0.000 0.018 0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers) 15 0.000 0.000 0.017 0.001 ops.py:725(wrapper) 34 0.001 0.000 0.017 0.000 internals.py:3495(reindex_indexer) 83 0.004 0.000 0.016 0.000 internals.py:3211(_consolidate_inplace) 68 0.015 0.000 0.015 0.000 {method 'copy' of 'numpy.ndarray' objects} 15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array) </code></pre> 如您所见，合并比连接快，虽然它是一个小值，但是超过4000次迭代，这个小值在几分钟内就变成了一个巨大的数字。 谢谢

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

提高Pandas合并绩效

1 个回答

相关Python问题