如何在有基本事实的情况下，只为数据帧找到真正的正性？

image_name______00000003_000.png label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']] Bounding_Box_____True/False Atelectasis _____0.172639399766922 Cardiomegaly _____0.064461663365364 Consolidation _____0.436323910951614 Edema _____0.152604594826698 Effusion _____0.077432356774807 Emphysema _____0.569778263568878 Fibrosis _____0.333310723304749 Hernia _____0.219542726874351 Infiltration _____0.240452200174332 Mass _____0.291741400957108 Nodule _____0.076222963631153 Pleural_Thickening_____ 0.294208467006683 Pneumonia _____0.281939893960953 Pneumothorax _____0.386653006076813

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv') df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1) df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1) df.groupby('best_score')['evaluation'].mean()

best_score Atelectasis 0.452465 Cardiomegaly 0.250000 Consolidation 0.123164 Edema 0.029520 Effusion 0.555459 Emphysema 0.068618 Fibrosis 0.066116 Hernia 0.032258 Infiltration 0.400000 Mass 0.177524 Nodule 0.604167 Pleural_Thickening 0.188482 Pneumonia 0.049133 Pneumothorax 0.108156 Name: evaluation, dtype: float64

1条回答

网友
1楼 · 发布于 2024-04-28 15:53:29

从您的DataFrame中：
>>> import pandas as pd >>> df file set label bbx Atelectasis Cardiomegaly Consolidation Edema Effusion Emphysema Fibrosis Hernia Infiltration Mass Nodule Pleural_Thickening Pneumonia Pneumothorax 0 00000003_000.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.145712 0.028958 0.205006 0.055228 0.115680 0.376638 0.349124 0.357694 0.122496 0.202218 0.075018 0.118994 0.195345 0.215577 1 00000003_001.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.132639 0.046136 0.169713 0.092743 0.285383 0.614464 0.311035 0.344040 0.117032 0.447748 0.152327 0.094364 0.174125 0.316022 2 00000003_002.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.233026 0.042541 0.227911 0.047988 0.116835 0.595102 0.330304 0.367272 0.117985 0.298624 0.109354 0.133473 0.185444 0.379627 3 00000003_003.png Test [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.... False 0.298693 0.022646 0.237977 0.035348 0.143645 0.487804 0.384509 0.379062 0.083205 0.625744 0.102377 0.207353 0.184517 0.354402 4 00000003_004.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.522152 0.052897 0.237475 0.082139 0.200029 0.473421 0.377468 0.336104 0.106339 0.488078 0.088047 0.146686 0.200919 0.313684
首先，我们eval列label以提取我们期望预测的类：
>>> df['label'] = df['label'].apply(eval) >>> df['class'] = df.label.apply(lambda x: x[1]) >>> df 0 [Hernia] 1 [Hernia] 2 [Hernia] 3 [Hernia, Infiltration] 4 [Hernia] 5 [Hernia] 6 [Hernia] 7 [Hernia] 8 [No Finding] 9 [Emphysema, Pneumothorax] 10 [Emphysema, Pneumothorax] 11 [Pleural_Thickening] 12 [Effusion, Emphysema, Infiltration, Pneumothorax] 13 [Emphysema, Infiltration, Pleural_Thickening, ... 14 [Effusion, Infiltration] 15 [Infiltration] Name: class, dtype: object
然后，我们explode列class按行获得预期的类，如下所示：
>>> df = df.explode('class') >>> df = df.reset_index(drop=True) >>> df['class'] 0 Hernia 1 Hernia 2 Hernia 3 Hernia 4 Infiltration 5 Hernia 6 Hernia 7 Hernia 8 Hernia 9 No Finding 10 Emphysema 11 Pneumothorax 12 Emphysema 13 Pneumothorax 14 Pleural_Thickening 15 Effusion 16 Emphysema 17 Infiltration 18 Pneumothorax 19 Emphysema 20 Infiltration 21 Pleural_Thickening 22 Pneumothorax 23 Effusion 24 Infiltration 25 Infiltration Name: class, dtype: object
然后，我们将数据转换为dummies格式：
>>> classes = ['Atelectasis', ... 'Cardiomegaly', ... 'Consolidation', ... 'Edema', ... 'Effusion', ... 'Emphysema', ... 'Fibrosis', ... 'Hernia', ... 'Infiltration', ... 'Mass', ... 'Nodule', ... 'Pleural_Thickening', ... 'Pneumonia', ... 'Pneumothorax', ... 'No Finding'] >>> s = df['class'] >>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0) >>> df_classes.head() Effusion Emphysema Hernia Infiltration No Finding Pleural_Thickening Pneumothorax 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 2 0 0 1 0 0 0 0 3 0 0 1 0 0 0 0 4 0 0 0 1 0 0 0
由于我们目前正在处理一个玩具数据集，我们必须进行一些调整，以便将所有需要的类作为假人格式进行利用：
>>> df_classes['Atelectasis'] = 0 >>> df_classes['Cardiomegaly'] = 0 >>> df_classes['Consolidation'] = 0 >>> df_classes['Edema'] = 0 >>> df_classes['Fibrosis'] = 0 >>> df_classes['Mass'] = 0 >>> df_classes['Nodule'] = 0 >>> df_classes['Pneumonia'] = 0 >>> df['No Finding'] = 0
现在，我们可以使用sklearn来获得我们的TRP，并最终得到AUC：
from sklearn.metrics import roc_curve, auc n_classes = len(classes) y_test = df_classes[classes].to_numpy() y_score = df[classes].to_numpy() # Compute ROC curve and ROC area for each class fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # Compute micro-average ROC curve and ROC area fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
现在，我们可以看一下roc_auc值，nan是由于并非所有类都在玩具数据集中预测的事实：
>>> roc_auc 1: nan, 2: nan, 3: nan, 4: 0.3125, 5: 0.7613636363636364, 6: nan, 7: 0.9479166666666666, 8: 0.6190476190476191, 9: nan, 10: nan, 11: 0.30208333333333337, 12: nan, 13: 0.7840909090909091, 14: 0.5, 'micro': 0.66562764158918}
现在，我们可以基于每个类的TPR和FPR绘制ROC_AUC曲线（注意classe这里，当我们处理玩具数据集时，一些类是空的）：
import matplotlib.pyplot as plt plt.figure() lw = 2 classe = 7 plt.plot(fpr[classe], tpr[classe], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe]) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle=' ') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend(loc="lower right") plt.show()

相关问题更多 >

编程相关推荐

热门问题

热门文章