我正在尝试优化PWEMA功能,以获得熊猫异常检测的实时传感器数据。我目前使用iterrows()将其作为for循环运行,但希望利用Pandas函数更快地计算更大的数据集,例如使用Apply()或矢量化,或利用EMA函数
PWEMA不同于EMA,因为beta参数控制了标准EMA设置为0时允许异常值影响MA的程度
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import datetime
import matplotlib.pyplot as plt
#plt.style.use('fivethirtyeight')
#%config InlineBackend.figure_format = 'retina'
#%matplotlib inline
from itertools import islice
from math import sqrt
from scipy.stats import norm
# generate random sensor data
ts = pd.date_range(start ='1-1-2019',
end ='1-10-2019', freq ='5T')
np.random.seed(seed=1111)
data = np.random.normal(2.012547, 1.557331,size=len(ts))
df = pd.DataFrame({'timestamp': ts, 'speed': data})
df.speed = df.speed.abs()
df = df.set_index('timestamp')
time_col = 'timestamp'
value_col = 'speed'
#pewna parameters
T = 30 # number of points to consider in initial average
beta = 0.5 # parameter that controls how much you allow outliers to affect your MA, for standard EWMA set to 0EWMA
a = 0.99 # the maximum value of the EWMA a parameter, used for outliers
z = 3
#the PEWNA Model
#as described in Carter, Kevin M., and William W. Streilein.
# create a DataFrame for the run time variables we'll need to calculate
pewm = pd.DataFrame(index=df.index, columns=['Mean', 'Var', 'Std'], dtype=float)
pewm.iloc[0] = [df.iloc[0][value_col], 0, 0]
t = 0
for _, row in islice(df.iterrows(), 1, None):
diff = row[value_col] - pewm.iloc[t].Mean # difference from moving average
p = norm.pdf(diff / pewm.iloc[t].Std) if pewm.iloc[t].Std != 0 else 0 # Prob of observing diff
a_t = a * (1 - beta * p) if t > T else 1 - 1/(t+1) # weight to give to this point
incr = (1 - a_t) * diff
# Update Mean, Var, Std
pewm.iloc[t+1].Mean = pewm.iloc[t].Mean + incr
pewm.iloc[t+1].Var = a_t * (pewm.iloc[t].Var + diff * incr)
pewm.iloc[t+1].Std = sqrt(pewm.iloc[t+1].Var)
t += 1
The for loop currently takes too long to run on larger sets
目前没有回答
相关问题 更多 >
编程相关推荐