基于时间计算同列值的平均值

-2 投票

1 回答

83 浏览

提问于 2025-04-14 17:25

我需要每5分钟计算一次同一列数字的平均值。这些数据可能每30秒或1分钟就会到来。我希望能用时间作为参考来计算这个平均值。

我试过下面的方法，但我必须在后面再加上时间，而且这个方法只适用于1分钟；我无法根据时间来计算平均值。

Planilha = pd.DataFrame({
    'Data/hora': ['01/02/2024 05:01','01/02/2024 05:02','01/02/2024 05:03','01/02/2024 05:04','01/02/2024 05:05','01/02/2024 05:06','01/02/2024 05:07','01/02/2024 05:08','01/02/2024 05:09','01/02/2024 05:10','01/02/2024 05:11','01/02/2024 05:12','01/02/2024 05:13','01/02/2024 05:14','01/02/2024 05:15'],
    'Valores_ok' : [21.48544006,32.41119499,44.18326492,59.37920151,76.55416718,93.16954193,121.0470154,164.0023529,207.9371277,198.1840485,150.4580994,144.5747345,155.5020691,155.5775085,160.8874695],
})

Irrad = Planilha.iloc[:,Planilha.columns.str.contains('Valores')] #Filtering the necessary data

IrradM = pd.DataFrame()
IrradM.columns = pd.DataFrame(columns=Irrad.columns)

for i in range(0, len(Irrad), 5):
    Irrad5m = Irrad.iloc[i:i+5].mean(numeric_only=True)
    Irrad5m = pd.DataFrame(Irrad5m).T
    IrradM = pd.concat([IrradM, Irrad5m], ignore_index=True)

另一个问题是因为数据量大导致的延迟。我想应该有更简单的方法来完成这个操作。

输入数据

       Data/hora  Valores_ok
01/02/2024 05:01   21.485440
01/02/2024 05:02   32.411195
01/02/2024 05:03   44.183265
01/02/2024 05:04   59.379202
01/02/2024 05:05   76.554167
01/02/2024 05:06   93.169542
01/02/2024 05:07  121.047015
01/02/2024 05:08  164.002353
01/02/2024 05:09  207.937128
01/02/2024 05:10  198.184048
01/02/2024 05:11  150.458099
01/02/2024 05:12  144.574735
01/02/2024 05:13  155.502069
01/02/2024 05:14  155.577508
01/02/2024 05:15  160.887470

预期输出

       Data/hora  Valores_ok
01/02/2024 05:05   46.802654
01/02/2024 05:10  156.868017
01/02/2024 05:15  153.399976

数据聚合统计计算时间序列分析实时数据处理数据流处理时间窗口数据平均值数据延迟

1 个回答

下面的例子展示了如何使用 resample 方法来计算每5分钟的平均值。我还加入了一个与滚动窗口平均值的对比，这样你可以看到每一行的5分钟平均值。最后有一些绘图，方便查看结果。

代码开始时会对数据框进行预处理：

加载数据，并把逗号转换为小数点
将日期列转换为 pandas 兼容的日期时间格式
我更喜欢把日期列设置为索引

最后，使用 .resample('5min') 方法进行重采样。我把它配置成与你的平均值相匹配。

加载并准备数据：

import pandas as pd

#Load data
df = pd.read_csv('sample_data.csv', decimal=',')
df = df[['Data /Hora', 'Valores ']]

#Rename columns
df = df.rename(columns={'Data /Hora': 'data', 'Valores ': 'valores'})

#Convert column to DateTime format
# I am assuming the format is: day/month/year 12-hour format:minute
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%y %I:%M')

#Set date as index
df = df.set_index('data')

重采样到5分钟的边界：

#Create resampled column
df['valores_resampled_5min'] = df['valores'].resample('5min', label='right', closed='right').mean()

#Optional
# Using a rolling window is a different way of averaging values.
# It will look back 5min at each row, and average the values
df['valores_rolling_5min'] = df['valores'].rolling('5min').mean()

查看结果：

from matplotlib import pyplot as plt
from matplotlib import dates as mdates

f, ax = plt.subplots(figsize=(8, 3))

ax.plot(df.index, df['valores'], marker='o', linewidth=2, color='lightgray', label='valores')
ax.scatter(df.index, df['valores_resampled_5min'], marker='s', color='crimson', label='resampled 5min')
ax.step(df.index, df['valores_resampled_5min'].bfill(), linestyle=':', linewidth=1.1, color='crimson')
ax.plot(df.index, df['valores_rolling_5min'], linewidth=3, color='tab:purple', label='rolling 5min', zorder=0)

ax.xaxis.set_major_formatter(mdates.ConciseDateFormatter(locator=None))
ax.xaxis.set_minor_locator(mdates.MinuteLocator())
ax.grid(axis='x', which='both', linestyle=':', color='gainsboro')
ax.set(xlabel='time', ylabel='valores')
ax.spines[['right', 'top']].set_visible(False)
ax.spines['left'].set_bounds(50, 200)
ax.spines['bottom'].set_bounds(mdates.date2num(df.index[0]), mdates.date2num(df.index[-1]))
f.legend()

回答于 2025-04-14 由 Python大师

分享举报

基于时间计算同列值的平均值

1 个回答

撰写回答