如何删除数据中的急剧跳跃？

2条回答

网友

1楼 · 编辑于 2024-05-28 18:33:42

尝试下面的代码（我使用了一个切线函数来生成数据）。我在评论中使用了疯狂物理学家的二阶差分思想。在

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])

#the following line calculates the absolute value of a second order finite 
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()

df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change 
df[1].plot()                  #plot original data

plt.show()

下面是输出的缩放，显示过滤后的内容。Matplotlib从删除数据的开始到结束绘制一条直线。在

我相信您的第一个问题的答案是上面的.loc选项。在

第二个问题需要对数据集进行一些实验。上面的代码只选择高导数数据。您还需要您的阈值选择来删除零等。您可以尝试在何处进行导数选择。您还可以绘制导数的柱状图，以提示您选择什么。在

此外，高阶差分方程也可能有助于平滑。这将有助于去除瑕疵，而不必修剪切口周围。在

编辑：

四阶有限差分可以使用以下公式：

^{pr2}$

有理由认为这可能有帮助。对于更高阶，以上系数可通过以下链接计算或导出。 Finite Difference Coefficients Calculator

注：以上二阶和四阶中心差分方程不是真一阶导数。必须除以区间长度（在本例中为0.005）才能得到实际导数。在

网友

2楼 · 编辑于 2024-05-28 18:33:42

这里有一个针对你的问题的建议

[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?

使用stats.zscore()和{a2}

事实上，这仍然是一个小问题，你的关注

[...]left with some residual artefacts from the data jumps near the edges[...]

但我们稍后再谈。在

首先，下面是一个片段，用于生成一个与数据集共享一些挑战的数据帧：

# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(22)

# A function for noisy data with a trend element
def sample():

    base = 100
    nsample = 50
    sigma = 10

    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)

    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']

    df=df['y4'].to_frame()
    df.columns = ['Temp']

    df['Temp'][20:31] = np.nan

    # Insert spikes and missing values
    df['Temp'][19] = df['Temp'][39]/4000
    df['Temp'][31] = df['Temp'][15]/4000

    return(df)

# Dataframe with random data
df_raw = sample()
df_raw.plot()

如您所见，有两个明显的峰值，它们之间缺少数字。如果你喜欢在差异很大的地方隔离值，那么问题就出在了缺失的数字上。第一个峰值不是问题，因为你会发现一个非常小的数字和一个与其他数据更相似的数字之间的差异：

但是对于第二个峰值，您将得到一个非常小的数字和一个不存在的数字之间的（不存在的）差异，因此您将最终删除的极端数据点是您的异常值与下一个观测值之间的差异：

对这一点来说，这不是一个大问题。你可以把它放回去。但对于更大的数据集，这将不是一个非常可行的解决方案。无论如何，如果您可以在没有特定值的情况下进行管理，下面的代码应该可以解决您的问题。您在第一次观察时也会遇到类似的问题，但我认为，决定是否保留这一个值要简单得多。在

步骤：

^{pr2}$

以下是简单复制粘贴的全部内容：

# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(22)

# A function for noisy data with a trend element
def sample():

    base = 100
    nsample = 50
    sigma = 10

    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)

    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']

    df=df['y4'].to_frame()
    df.columns = ['Temp']

    df['Temp'][20:31] = np.nan

    # Insert spikes and missing values
    df['Temp'][19] = df['Temp'][39]/4000
    df['Temp'][31] = df['Temp'][15]/4000

    return(df)

# A function for removing outliers
def noSpikes(df, level, keepFirst):

    # 1. Get some info about the original data:
    firstVal = df[:1]
    colName = df.columns

    # 2. Take the first difference and 
    df_diff = df.diff()

    # 3. Remove missing values
    df_clean = df_diff.dropna()

    # 4. Select a level for a Z-score to identify and remove outliers
    df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 5. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df_raw.loc[ix_keep]

    # 6. 
    # df_keep will be missing some indexes.
    # Do the following if you'd like to keep those indexes
    # and, for example, fill missing values with the previous values
    df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)

    # 7. Keep only the first column
    df_out = df_out.ix[:,0].to_frame()

    # 8. Fill missing values
    df_complete = df_out.fillna(axis=0, method='ffill')

    # 9. Reset column names
    df_complete.columns = colName

    # Keep the first value
    if keepFirst:
        df_complete.iloc[0] = firstVal.iloc[0]

    return(df_complete)

# Dataframe with random data
df_raw = sample()
df_raw.plot()

# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)

df_cleaned.plot()

相关问题更多 >

编程相关推荐

热门问题

热门文章