求最大折扣长度

In [23]: df Out[23]: 2016-07-01 05:35:00 60.466667 2016-07-01 05:40:00 NaN 2016-07-01 05:45:00 NaN 2016-07-01 05:50:00 NaN 2016-07-01 05:55:00 NaN 2016-07-01 06:00:00 NaN 2016-07-01 06:05:00 NaN 2016-07-01 06:10:00 NaN 2016-07-01 06:15:00 NaN 2016-07-01 06:20:00 NaN 2016-07-01 06:25:00 NaN 2016-07-01 06:30:00 NaN 2016-07-01 06:35:00 NaN 2016-07-01 06:40:00 NaN 2016-07-01 06:45:00 NaN 2016-07-01 06:50:00 NaN 2016-07-01 06:55:00 NaN 2016-07-01 07:00:00 NaN 2016-07-01 07:05:00 NaN 2016-07-01 07:10:00 NaN 2016-07-01 07:15:00 NaN 2016-07-01 07:20:00 NaN 2016-07-01 07:25:00 NaN 2016-07-01 07:30:00 NaN 2016-07-01 07:35:00 NaN 2016-07-01 07:40:00 NaN 2016-07-01 07:45:00 63.500000 2016-07-01 07:50:00 67.293333 2016-07-01 07:55:00 67.633333 2016-07-01 08:00:00 68.306667 ... 2016-07-01 11:20:00 NaN 2016-07-01 11:25:00 NaN 2016-07-01 11:30:00 62.000000 2016-07-01 11:35:00 69.513333 2016-07-01 11:40:00 64.931298 2016-07-01 11:45:00 51.980000 2016-07-01 11:50:00 55.253333 2016-07-01 11:55:00 51.273333 2016-07-01 12:00:00 52.080000 2016-07-01 12:05:00 54.580000 2016-07-01 12:10:00 55.306667 2016-07-01 12:15:00 55.200000 2016-07-01 12:20:00 57.140000 2016-07-01 12:25:00 57.020000 2016-07-01 12:30:00 57.526667 2016-07-01 12:35:00 57.880000 2016-07-01 12:40:00 67.286667 2016-07-01 12:45:00 58.153333 2016-07-01 12:50:00 57.460000 2016-07-01 12:55:00 54.413333 2016-07-01 13:00:00 55.526667 2016-07-01 13:05:00 56.120000 2016-07-01 13:10:00 55.620000 2016-07-01 13:15:00 56.420000 2016-07-01 13:20:00 51.893333 2016-07-01 13:25:00 74.451613 2016-07-01 13:30:00 54.898551 2016-07-01 13:35:00 NaN 2016-07-01 13:40:00 63.355140 2016-07-01 13:45:00 61.000000 Freq: 5T, dtype: float64

2条回答

网友

1楼 · 编辑于 2024-05-29 03:45:20

只要您有一个系列或单列数据帧，这就应该可以工作。你知道吗

>>>pd.Series(df.isnull().index).diff()

可以通过以下方式进行改进以获得有用的输出：

MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
discontinuities = pd.Series(df.isnull().index).diff()
discontinuities.sort(ascending=False)
discontinuities[discontinuities > MIN_GAP_TIMEDELTA].size

网友

2楼 · 编辑于 2024-05-29 03:45:20

不像基于pandas的解决方案那样优雅或简洁，但是考虑到性能，可以考虑使用NumPy数组和函数。所以，为了解决这种情况，假设日期时间有一个规则的频率，这里有一个基于NumPy的方法来获得不连续长度，最大长度和阈值计数-

# Get indices of start and stop indices of discontinuities signified by NaNs
idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]

# Do differentiation on those indices which would give us the length of 
# intervals of discontinuities. These could be used in various ways.
discontinuity_lens = np.diff(idx.reshape(-1,2),axis=1)

# Max discontinuity length
discontinuity_maxlen = discontinuity_lens.max()

# Count of discontinuities that are greater than a threshold of 30 mins as
# listed with threshold parameter : MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
# (in terms of steps that would be 6 because freq of input dataframe is 5 mins)
thresholded_count = (discontinuity_lens>=6).sum()

请注意，这主要是基于另一个^{}

运行时测试

我将对^{}进行计时，并将本文前面发布的基于NumPy的方法放在一个足够大的数据帧上，该数据帧中填充了随机元素并随机放置了50%的nan。你知道吗

函数定义：

def thresholdedcount_pandas(df):
    MIN_GAP_TIMEDELTA = pd.Timedelta(minutes=30)
    discontinuities = df.dropna().reset_index()['index'].diff()
    return (discontinuities > MIN_GAP_TIMEDELTA).sum()

def thresholdedcount_numpy(df):
    idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
    nan_interval_lens = np.diff(idx.reshape(-1,2),axis=1)
    return (nan_interval_lens>=6).sum()

时间安排：

In [325]: # Random dataframe with 5 min interval data and filled with 50% NaNs
     ...: rng = pd.date_range('1/1/2011', periods=10000, freq='5Min')
     ...: df = pd.DataFrame(np.random.randn(len(rng)), index=rng)
     ...: df[0][np.random.randint(0,df.shape[0],(int(df.shape[0]/2)))] = np.nan
     ...: 

In [326]: np.allclose(thresholdedcount_pandas(df),thresholdedcount_numpy(df))
Out[326]: True

In [327]: %timeit thresholdedcount_pandas(df)
100 loops, best of 3: 3 ms per loop

In [328]: %timeit thresholdedcount_numpy(df)
1000 loops, best of 3: 318 µs per loop

相关问题更多 >

编程相关推荐

热门问题

热门文章