当使用整个数据集测试在欠采样数据上训练的分类器时，精度会显著下降

2024-04-20 11:34:31 发布

男 | 程序猿一只，喜欢编程写python代码。

我在做Kaggle信用卡欺诈检测

在{}（欺诈性交易）和{}（非欺诈性交易）之间存在严重的不平衡。作为补偿，我对数据取样不足，因此欺诈性交易和非欺诈性交易之间的比例为1:1（各492）。当我在欠采样/平衡数据上训练逻辑回归分类器时，它表现良好。然而，当我使用同一个分类器并在整个数据集上测试它时，召回率仍然很好，但精确度显著下降

我知道，对于这类问题，高召回率更为重要，但我仍想了解为什么会出现精密坦克，以及这是否合适

代码：

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

def model_report(y_test, pred):
    print("Accuracy:\t", accuracy_score(y_test, pred))
    print("Precision:\t", precision_score(y_test, pred))
    print("RECALL:\t\t", recall_score(y_test, pred))
    print("F1 Score:\t", f1_score(y_test, pred))

df = pd.read_csv("data/creditcard.csv")
target = 'Class'
X = df.loc[:, df.columns != target]
y = df.loc[:, df.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print("WITHOUT UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)

# Creates the undersampled DataFrame with 492 fraud and 492 clean
minority_class_len = len(df[df[target] == 1])
minority_class_indices = df[df[target] == 1].index
majority_class_indices = df[df[target] == 0].index
random_majority_indices = np.random.choice(majority_class_indices, minority_class_len, replace=False)
undersample_indices = np.concatenate([minority_class_indices, random_majority_indices])
undersample = df.loc[undersample_indices]

X_undersample = undersample.loc[:, undersample.columns != target]
y_undersample = undersample.loc[:, undersample.columns == target]
x_train, x_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size=0.33, random_state=42)

print("\nWITH UNDERSAMPLING:")
clf = LogisticRegression().fit(x_train, y_train)
pred = clf.predict(x_test)
model_report(y_test, pred)

print("\nWITH UNDERSAMPLING & TESTING ON ENIRE DATASET:")
pred = clf.predict(X)
model_report(y, pred)

输出：

WITHOUT UNDERSAMPLING:
Accuracy:        0.9989679423750093
Precision:       0.7241379310344828
RECALL:          0.5637583892617449
F1 Score:        0.6339622641509434

WITH UNDERSAMPLING:
Accuracy:        0.9353846153846154
Precision:       0.9673202614379085
RECALL:          0.9024390243902439
F1 Score:        0.9337539432176657

WITH UNDERSAMPLING & TESTING ON ENIRE DATASET:
Accuracy:        0.9595936897618387
Precision:       0.03760913364674278
RECALL:          0.9105691056910569
F1 Score:        0.07223476297968398

Tags： test import target df model train random loc

0条回答

目前没有回答

当使用整个数据集测试在欠采样数据上训练的分类器时，精度会显著下降

相关问题更多 >

编程相关推荐

热门问题

热门文章

当使用整个数据集测试在欠采样数据上训练的分类器时，精度会显著下降

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >