在时间序列中合并两个列名不同的数据框
我有两个数据框,一个叫做sensor,另一个叫做train。
sensor数据框里存的是时间序列的数据,索引是ts_sensor这一列。在sensor数据框中,列的名字代表了传感器的名称,每个单元格里填的都是在ts_sensor时刻传感器测得的值。而train数据框里有一列叫id_sensor,里面填的是传感器的名称(和sensor数据框里的名称一样),还有一列叫boot_threshold,每个单元格里填的是对应传感器的阈值。
以下是关于这两个数据框的信息。
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id_train 264 non-null object
1 asset_id 264 non-null object
2 train_type 264 non-null object
3 equipment_type 264 non-null object
4 id_sensor 264 non-null object
5 unit 262 non-null object
6 is_operating 262 non-null float64
7 will_be_available 262 non-null float64
8 pi_acquisition_date 0 non-null float64
9 pi_description 264 non-null object
10 sensor_description 264 non-null object
11 tag_id 262 non-null float64
12 priority 262 non-null float64
13 for_ml 264 non-null int64
14 onoff_parameter 264 non-null int64
15 boot_threshold 264 non-null int64
16 efficiency_index 264 non-null int64
dtypes: float64(5), int64(4), object(8)
memory usage: 35.2+ KB
sensor.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1729930 entries, 0 to 1729929
Columns: 135 entries, level_0 to Australia.Blacktip.ICSS.KPI.EE.GasTurbineEfficiency_A
dtypes: float64(132), int64(2), object(1)
memory usage: 1.7+ GB
sensor.columns
Index(['level_0', 'index', 'sensor_ts',
'Australia.Blacktip.ICSS.360140413-100.PV',
'Australia.Blacktip.ICSS.360140415-100.PV',
'Australia.Blacktip.ICSS.360140417-100.PV',
'Australia.Blacktip.ICSS.360140421-100.PV',
'Australia.Blacktip.ICSS.360140423-100.PV',
'Australia.Blacktip.ICSS.360140425-100.PV',
'Australia.Blacktip.ICSS.360140433-100.PV',
...
'Australia.Blacktip.ICSS.3601PDI144.DACA.PV',
'Australia.Blacktip.ICSS.3601PIT111.PV',
'Australia.Blacktip.ICSS.3601PIT141.PV',
'Australia.Blacktip.ICSS.3601TIT042.PV',
'Australia.Blacktip.ICSS.3601TIT052.PV',
'Australia.Blacktip.ICSS.3601TIT111.PV',
'Australia.Blacktip.ICSS.3601TIT122.PV',
'Australia.Blacktip.ICSS.3601TIT141.PV',
'Australia.Blacktip.ICSS.3601TIT152.PV',
'Australia.Blacktip.ICSS.KPI.EE.GasTurbineEfficiency_A'],
dtype='object', length=135)
我想得到一个新的数据框,这个数据框要保留时间序列,并且只包含那些当前ts_sensor值高于对应boot_threshold值的传感器。
谢谢大家!
1 个回答
0
你是在找这个吗?
import pandas as pd
import numpy as np
# Simulated sensor data (a small part of the actual data for demonstration)
sensor_data = {
'sensor_ts': pd.date_range(start='2023-01-01', periods=4, freq='D'),
'Australia.Blacktip.ICSS.360140413-100.PV': [100, 200, 300, 400],
'Australia.Blacktip.ICSS.360140415-100.PV': [100, 220, 330, 440],
}
sensor = pd.DataFrame(sensor_data)
# Simulated train data
train_data = {
'id_sensor': ['Australia.Blacktip.ICSS.360140413-100.PV', 'Australia.Blacktip.ICSS.360140415-100.PV'],
'boot_threshold': [250, 300],
}
train = pd.DataFrame(train_data)
# Reshape sensor DataFrame from wide to long format
sensor_long = pd.melt(sensor, id_vars=['sensor_ts'], var_name='id_sensor', value_name='value')
# Merge the reshaped sensor data with the train data
merged_data = pd.merge(sensor_long, train, on='id_sensor')
# Filter to include only rows where sensor value is greater than the boot_threshold
filtered_data = merged_data[merged_data['value'] > merged_data['boot_threshold']]
# Displaying the filtered data for verification
filtered_data
输出结果:
sensor_ts id_sensor value boot_threshold
2 2023-01-03 Australia.Blacktip.ICSS.360140413-100.PV 300 250
3 2023-01-04 Australia.Blacktip.ICSS.360140413-100.PV 400 250
6 2023-01-03 Australia.Blacktip.ICSS.360140415-100.PV 330 300
7 2023-01-04 Australia.Blacktip.ICSS.360140415-100.PV 440 300