我如何循环浏览时间列表,找到跳过/缺少的时间,然后填入该时间?

2024-06-16 11:00:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我每天每12分钟就有一份文件清单。我应该有120个文件,但我只有114个。这意味着有6到12分钟的时间段丢失数据/文件。我创建了一个数组,左边是datetimes,右边是我的数据:


"row" -|-     "Time"       -|-       "Rain_Rate"      


1:    2014/04/02 00:00:09  |           0.5

2:    2014/04/02 00:12:20  |           1.1

3:    2014/04/02 00:24:05  |           0.48

4:    2014/04/02 00:36:51  |           2.3

5:    2014/04/02 01:00:08  |           4.1

6:    2014/04/02 01:12:26  |           5.0  

7:    2014/04/02 01:24:02  |           3.2

如您所见,第4行和第5行之间缺少时间(应该是00:48:00)。我如何通过计算两边的平均降雨量来填充这一行

另外,如果有2次/行背对背丢失,我怎么做


Tags: 文件数据ratetime时间数组row时间段
2条回答

首先,以12分钟的频率构建DateTimeIndex:

import datetime
import pandas as pd
import numpy as np

start = datetime.datetime(2014, 4, 2)
end = datetime.datetime(2014, 8, 1)  # whenever your time series ends
idx = pd.date_range(start, end, freq='12T')  # 12T = 12 minutes

接下来,您必须使用新索引构建数据帧:

df = pd.DataFrame(np.nan, index=idx, columns=['dummy']) # you need to provide a column name

我想,您的数据是在一个带有DateTimeIndex的pd.Seriess中,否则您必须构建一个

现在,您可以使用DateTimeIndex和pandas的全部功能:

df['Rain_Rate'] = s
df['Rain_Rate'] = df['Rain_Rate'].interpolate()  # standard is linear approximation

查看interpolate()了解更多选项

作为替代方案,您可以保持Time序列的原样,并且仅将行添加到您所描述的缺失位置,使用行之间的时间差大于12分钟作为条件。作为一种权衡,取决于你以前的现有价值观,你不会在12分钟内得到完美的结果。泛型行和下一个现有值之间的时间片

import pandas as pd

df = pd.DataFrame([
["2014/04/02 00:00:09",0.5],
["2014/04/02 00:12:20",1.1],
["2014/04/02 00:24:05",0.48],
["2014/04/02 00:36:51",2.3],
["2014/04/02 01:00:08",4.1],
["2014/04/02 01:12:26",5.0],
["2014/04/02 01:24:02",3.2],
["2014/04/02 02:44:02",1.2], # added for test
["2014/04/02 03:54:02",7.72] # added for test
])

df[0] = pd.to_datetime(df[0])
print(df)

delta = df[0].diff()
diff_idx = delta.where(delta > pd.Timedelta("00:12:59"))
print(delta)
idx = df[diff_idx.notnull()].index

td = pd.Timedelta("00:12:00")

for k in idx:
    deltaT = (df.loc[k, 0] - df.loc[k - 1, 0])
    num_missrows =  deltaT // td
    num_missrows -= (-1, 0)[deltaT % td == pd.Timedelta(0)] # don't overlap last time value
    new_avg = df.loc[k - 1, 1]  # previous existing Rain value

    for i in range(1, num_missrows):
        avg = (df.loc[k, 1] + new_avg)/2
        new_row = [[df.loc[k-1, 0] + i * td, avg]]
        new_avg = avg
        df = df.append(new_row)

df = df.sort_values(by=0).reset_index(drop=True)
print(df)

来自df的输出

                     0        1
0  2014-04-02 00:00:09  0.50000
1  2014-04-02 00:12:20  1.10000
2  2014-04-02 00:24:05  0.48000
3  2014-04-02 00:36:51  2.30000
4  2014-04-02 00:48:51  3.20000 # added row
5  2014-04-02 01:00:08  4.10000
6  2014-04-02 01:12:26  5.00000
7  2014-04-02 01:24:02  3.20000
8  2014-04-02 01:36:02  2.20000 # added row
9  2014-04-02 01:48:02  1.70000 # added row
10 2014-04-02 02:00:02  1.45000 # added row
11 2014-04-02 02:12:02  1.32500 # added row
12 2014-04-02 02:24:02  1.26250 # added row
13 2014-04-02 02:36:02  1.23125 # added row, Not 12min. (~8min. diff.)
14 2014-04-02 02:44:02  1.20000
15 2014-04-02 02:56:02  4.46000 # added row
16 2014-04-02 03:08:02  6.09000 # added row
17 2014-04-02 03:20:02  6.90500 # added row
18 2014-04-02 03:32:02  7.31250 # added row
19 2014-04-02 03:44:02  7.51625 # added row, Not 12min. (~10min. diff.)
20 2014-04-02 03:54:02  7.72000

相关问题 更多 >