使用哪个函数绘制3个不同数据库数据的线性相关图

2024-06-09 22:42:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要关联三个数据库,telemetry3h、telemetry24h和error_coun

我不知道如何将pandas的df.corr()函数与问题联系起来,我无法编写这部分代码

我需要得到这个结果,但我找不到一个有效的函数,有人有什么建议吗

最后是通过代码中已有的函数打印的两行数据

enter image description here

import pandas as pd


telemetry = pd.read_csv('D:/IC/teste_disseracao/banco_de_dados_PdM/PdM_telemetry.csv')

telemetry['datetime'] = pd.to_datetime(telemetry['datetime'], format="%Y-%m-%d %H:%M:%S")

#valores da telemetria no com intervalo de 3 em 3 horas (media)
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry,
                               index='datetime',
                               columns='machineID',
                               values=col).rolling(window=3).mean().resample('3H',
                                                                              closed='left',
                                                                              label='right').mean().unstack())
telemetrynew_media = pd.concat(temp, axis=1)
telemetrynew_media.columns = [i + 'mean_3h' for i in fields]
telemetrynew_media.reset_index(inplace=True)


#valores da telemetria no com intervalo de 3 em 3 horas (desvio padrao)
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry,
                               index='datetime',
                               columns='machineID',
                               values=col).rolling(window=3).std().resample('3H',
                                                                             closed='left',
                                                                             label='right').std().unstack())
telemetrynew_dp = pd.concat(temp, axis=1)
telemetrynew_dp.columns = [i + 'sd_3h' for i in fields]
telemetrynew_dp.reset_index(inplace=True)


telemetry3h = telemetrynew_media


#valores da telemetria no com intervalo de 24 em 24 horas (media)
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry,
                               index='datetime',
                               columns='machineID',
                               values=col).rolling(window=24).mean().resample('3H',
                                                                              closed='left',
                                                                              label='right').mean().unstack())
telemetrynew_media = pd.concat(temp, axis=1)
telemetrynew_media.columns = [i + 'mean_24h' for i in fields]
telemetrynew_media.reset_index(inplace=True)
telemetrynew_media = telemetrynew_media.loc[-telemetrynew_media['voltmean_24h'].isnull()]

#valores da telemetria no com intervalo de 24 em 24 horas (desvio padrao)
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry,
                               index='datetime',
                               columns='machineID',
                               values=col).rolling(window=24).std().resample('3H',
                                                                             closed='left',
                                                                             label='right').std().unstack())
telemetrynew_dp = pd.concat(temp, axis=1)
telemetrynew_dp.columns = [i + 'sd_24h' for i in fields]
telemetrynew_dp.reset_index(inplace=True)
telemetrynew_dp = telemetrynew_dp.loc[-telemetrynew_dp['voltsd_24h'].isnull()]

telemetry24h = telemetrynew_media

errors = pd.read_csv('D:/IC/teste_disseracao/banco_de_dados_PdM/PdM_errors.csv')

errors['datetime'] = pd.to_datetime(errors['datetime'], format="%Y-%m-%d %H:%M:%S")

#uma coluna para cada tipo de erro
error_count = pd.get_dummies(errors)
error_count.columns = ['datetime', 'machineID','volt', 'rotate', 'pressure', 'vibration']

#combinar erros para uma determinada máquina em uma determinada hora 
error_count = error_count.groupby(['machineID', 'datetime']).sum().reset_index()
machineID   datetime    voltmean_24h    rotatemean_24h  pressuremean_24h    vibrationmean_24h
1   2015-01-02 06:00:00 169.73380889577277  445.17986460181123  96.79711296200088   40.38515998915177

1   2015-01-02 09:00:00 170.7007271750606   445.2941115744456   97.23217240183708   39.64764312098307


machineID   datetime    voltmean_3h rotatemean_3h   pressuremean_3h vibrationmean_3h
1   2015-09-02 00:00:00 165.24762959802376  445.7493967947973   90.03620169154418   63.43759320078462

1   2015-09-01 21:00:00 169.3392785585131   447.60737912233566  98.0559196736471    63.0969156902752


datetime        machineID   errorID_error1  errorID_error2  errorID_error3  errorID_error4  errorID_error5
2015-01-03 07:00:00 1       1       0       0       0       0

2015-01-27 04:00:00 1       1       0       0       0       0

Tags: columnsinfieldsfordatetimeindexdecol
1条回答
网友
1楼 · 发布于 2024-06-09 22:42:45

我认为您正在寻找的是一种连接列的方法,然后找到数据帧中每个列之间的相关性,对吗

我像这样加载了您的样本:

#creation of first dataframe     
values_one = [[1 , '2015-01-02 06:00:00', 169.73380889577277, 445.17986460181123, 96.79711296200088, 40.38515998915177],
   [1 , '2015-01-02 09:00:00', 170.7007271750606, 445.2941115744456, 97.23217240183708, 39.64764312098307]]
df_1 = pd.DataFrame(values_one, columns =['machineID', 'datetime', 'voltmean_24h', 'rotatemean_24h', 'pressuremean_24h', 'vibrationmean_24h'])

#creation of second dataframe
values_two = [[1 , '2015-09-02 00:00:00', 165.24762959802376, 445.7493967947973, 90.03620169154418, 63.43759320078462],
   [1 , '2015-09-01 21:00:00', 169.3392785585131, 447.60737912233566, 98.0559196736471, 63.0969156902752]]
df_2 = pd.DataFrame(values_two, columns =['machineID', 'datetime', 'voltmean_3h', 'rotatemean_3h', 'pressuremean_3h ', 'vibrationmean_3h'])

#creation of first dataframe   
values_three = [[1 , '2015-09-02 00:00:00', 1, 0, 0, 0],
   [1 , '2015-09-01 21:00:00', 1, 0, 0, 0]] 
df_3 = pd.DataFrame(values_three, columns =['machineID', 'datetime', 'errorID_error1', 'errorID_error2', 'errorID_error3', 'errorID_error4'])

然后,您可以为索引列指定一个名称,删除不想关联的列,然后连接索引上的其余列。像这样:

df_1.index.name = 'Index'
df_1 = df_1.drop(['machineID', 'datetime'], axis=1)
df_2.index.name = 'Index'
df_2 = df_2.drop(['machineID', 'datetime'], axis=1)
df_3.index.name = 'Index'
df_3 = df_3.drop(['machineID', 'datetime'], axis=1)

#join dataframes into one
#join first and second dataframe
df_joined = pd.merge(df_1, df_2, how ='outer', on ='Index') 
#join first & second with third dataframe
df_joined = pd.merge(df_joined, df_3, how ='outer', on ='Index') 

现在,剩下要做的就是调用correlation方法

df_joined.corr(method='pearson')

我不知道你的数据,但我看到你有一些时间戳。也许你必须想出一个更聪明的索引机制(或者更好的一个键),它能考虑到你的记录是在什么时候被获取的。例如,您可能希望一行只对应一天

相关问题 更多 >