Pandas使用np.where()和iterrow()填充缺少的数据…但是速度太慢了,请告诉我如何改进

2024-06-06 12:28:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试根据日期填充pandas数据框中缺少的数据值

近似值范围为54.5到71.5。 当on/off为1时,该值增大;当on/off为0时,该值减小

>> before (example)
day_time        value  on/off
2020-03-01 0:05 71.35    0
2020-03-01 0:06 68.425   0
2020-03-01 0:07 66.1     0
2020-03-01 0:08 64.125   0
2020-03-01 0:09 58.9     0
2020-03-01 0:10 56.075   0
2020-03-01 0:11 54.35    0
2020-03-01 0:12 57.025   1
2020-03-01 0:13 59.35    1
2020-03-01 0:14 63.2     1
2020-03-01 0:15 65.375   1
2020-03-01 0:16 66.35    1
2020-03-01 0:17 67.25    1
2020-03-01 0:18 70.05    1
2020-03-01 0:19 NaN      NaN
2020-03-01 0:20 NaN      NaN
2020-03-01 0:21 NaN      NaN
2020-03-01 0:22 NaN      NaN
2020-03-01 0:23 NaN      NaN
2020-03-01 0:24 NaN      NaN
2020-03-01 0:25 NaN      NaN
2020-03-01 0:26 NaN      NaN
2020-03-01 0:27 NaN      NaN
2020-03-01 0:28 NaN      NaN
2020-03-01 0:29 NaN      NaN
2020-03-01 0:30 NaN      NaN
2020-03-01 0:31 NaN      NaN
2020-03-01 0:32 65.475   1
2020-03-01 0:33 65.475   1
2020-03-01 0:34 65.525   0

我在缺失值出现时计算值, 我想填满它

我想计算它,以便它可以在71.5~54.5的范围内,在缺失值出现之前,通过值的变化量(平均值)反复增加或减少

>> after (example)
day_time        value  on/off
2020-03-01 0:05 71.35    0
2020-03-01 0:06 68.425   0
2020-03-01 0:07 66.1     0
2020-03-01 0:08 64.125   0
2020-03-01 0:09 58.9     0
2020-03-01 0:10 56.075   0
2020-03-01 0:11 54.35    0
2020-03-01 0:12 57.025   1
2020-03-01 0:13 59.35    1
2020-03-01 0:14 63.2     1
2020-03-01 0:15 65.375   1
2020-03-01 0:16 66.35    1
2020-03-01 0:17 67.25    1
2020-03-01 0:18 70.05    1
2020-03-01 0:19 68.05    0
2020-03-01 0:20 67.35    0
2020-03-01 0:21 65.21    0
2020-03-01 0:22 63.275   0
2020-03-01 0:23 65.225   0
2020-03-01 0:24 63.65    0
2020-03-01 0:25 61.45    0
2020-03-01 0:26 58.45    0
2020-03-01 0:27 56.275   0
2020-03-01 0:28 55.475   0
2020-03-01 0:29 54.3     0
2020-03-01 0:30 57.7     1
2020-03-01 0:31 59.5     1
2020-03-01 0:32 61.4     1
2020-03-01 0:33 63.5     1
2020-03-01 0:34 65.525   1

我试试下面

for i in result.iterrows():
  result['pump'] = np.where(pd.isnull(result.pump), np.where((result.pump.shift(1) == 0) & (result.g_hight.shift(1) > 54), 0, result.pump), result.pump)
  result['pump'] = np.where(pd.isnull(result.pump), np.where((result.pump.shift(1) == 0) & (result.g_hight.shift(1) < 72), 1, result.pump), result.pump)
  result['pump'] = np.where(pd.isnull(result.pump), np.where((result.pump.shift(1) == 1) & (result.g_hight.shift(1) < 72), 1, result.pump), result.pump)
  result['pump'] = np.where(pd.isnull(result.pump), np.where((result.pump.shift(1) == 1) & (result.g_hight.shift(1) > 54), 0, result.pump), result.pump)

  value_ON = result['g_hight'].shift(1) - result['fi_usage'].shift(1) + 0.2503
  value_OFF = (result['g_hight'].shift(1) - result['fi_usage'].shift(1))
  result['g_hight'] = np.where((pd.isnull(result.g_hight)) & (pd.notna(result.pump)), np.where(result.pump == 0, value_OFF, value_ON), result.g_hight)
result.to_csv('result_1.csv', index = False)

它正在工作,但是。。太晚了。。 如何改进这个过程


Tags: 数据shiftvalueonexamplenpresultnan
1条回答
网友
1楼 · 发布于 2024-06-06 12:28:32

这个问题有些模糊,可能需要做大量的工作,所以我在下面概述一下计划

我会从一个简单的模型开始,比如

value = a sin(bx + c) + d

为什么是正弦?因为它是周期性的,在增长和衰退之间波动,是一个很好的简单模型

我建议先估计b。它是整个循环的一个值。需要多长时间才能再次达到最大值?说,t时间。然后,b = 2 * pi / t

一旦b被确定,我推荐以下技巧:

value = a sin(bx + c) + d = A sin(bx) + B cos(bx) + C

如果我们知道b,那么我们就知道sin(bx)cos(bx),因此,我们所需要知道的就是ABC。使用已知值的回归可以找到它们。最后,应用该公式估计缺失值

相关问题 更多 >