随机森林“完美”混淆矩阵

2024-06-16 09:06:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个分类问题,我想确定不应该被邀请参加银行会议的潜在借款人。 根据数据,约25%的借款人不应被邀请。 我有大约4500个观察和86个特征(许多假人)

清理数据后,我执行以下操作:

# Separate X_train and Y_train

X = ratings_prepared[:, :-1]
y= ratings_prepared[:,-1]

##################################################################################

# Separate test and train (stratified, 20% test)

import numpy as np
from sklearn.model_selection import StratifiedKFold

from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X,y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

然后,我开始训练模型。SGD分类器工作不太好:

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label =label)
    plt.plot([0,1], [0,1],'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1],"b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threashold")
    plt.legend(loc="center left")
    plt.ylim([0,1])

############################# Train Models #############################

from sklearn.linear_model import SGDClassifier

sgd_clf =SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train)
y_pred = sgd_clf.predict(X_train)

# f1 score

f1_score(y_train, y_pred)

# confusion matrix

tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

disp = plot_confusion_matrix(sgd_clf, X_train, y_train,
                                 cmap=plt.cm.Blues,
                                 normalize='true')

# recall and precision

from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)

# Precision Recall

from sklearn.metrics import precision_recall_curve

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

# Plot ROC curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train, y_scores)

plot_roc_curve(fpr, tpr)
plt.show()

# recall and precision

from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
### Precision score: 0.5084427767354597

Results from the SGD classifier

然后,我将讨论一个随机森林分类器,它将改进SGD

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

事实上,ROC曲线看起来更好:

ROC curve RF

但是混淆矩阵和精确分数非常奇怪:

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)

forest_clf.fit(X_train,y_train)
y_pred = forest_clf.predict(X_train)


# f1 score

f1_score(y_train, y_pred)

# confusion matrix

from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

disp = plot_confusion_matrix(forest_clf, X_train, y_train,
                                 cmap=plt.cm.Blues,
                                 normalize='true')

Confusion Matrix RF

F分也是1分。我不明白这里发生了什么。我怀疑我犯了一个错误,但SGD分类器似乎工作正常这一事实让我认为这与数据清理无关

你知道会出什么问题吗

#

更新:

1)以绝对值表示的混淆矩阵:

enter image description here

2)降低门槛:

enter image description here


Tags: fromtestimportplottrainpltsklearnprecision
1条回答
网友
1楼 · 发布于 2024-06-16 09:06:50

您获得满分的原因是您没有对测试数据进行度量

在第一段中,您将对训练和测试数据进行80/20分割,但是所有度量ROC、混淆矩阵等都是在原始训练数据上进行的,而不是在测试数据上

有了这样的设置,你的报告会显示你疯狂地过度装修

您应该做的是将经过培训的模型应用于测试数据,并查看该模型的工作方式

相关问题 更多 >