模型选择.Kfold给出的结果与kf.spli公司

2024-04-25 10:15:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究一个数据集TelcoSigtel,它有5k个观测值,21个特征,一个不平衡的目标,有86%的非搅和者和16%的搅和者。你知道吗

对不起,我想给一个数据帧的摘录,但它是太大了,或当我试图采取一小串有没有足够的搅和机。你知道吗

我的问题是,下面这两种方法应该给出相同的结果,但在某些算法上有很大的不同,而在另一些算法上,它们给出的结果完全相同。你知道吗

有关数据集的信息:

models = [('logit',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=600,
                     multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
                     solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ....]
# Method 1:

from sklearn import model_selection
from sklearn.model_selection import KFold


X = telcom.drop("churn", axis=1)
Y = telcom["churn"]

results = []
names = []

seed = 0
scoring = "roc_auc"
for name, model in models:
    kfold = model_selection.KFold(n_splits = 5, random_state = seed)

    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison-AUC')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.grid()

plt.show()

enter image description here

# Method 2:


from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score

kf = KFold(n_splits=5, random_state=0)

X = telcom.drop("churn", axis=1)
Y = telcom["churn"]

results = []
names = []

to_store1 = list()

seed = 0
scoring = "roc_auc"

cv_results = np.array([])

for name, model in models:
    for train_index, test_index in kf.split(X):
        # split the data
        X_train, X_test = X.loc[train_index,:].values, X.loc[test_index,:].values
        y_train, y_test = np.ravel(Y[train_index]), np.ravel(Y[test_index])  

        model = model  # Choose a model here
        model.fit(X_train, y_train )  
        y_pred = model.predict(X_test)

        to_store1.append(train_index)

        # store fold results
        result = roc_auc_score(y_test, y_pred)
        cv_results = np.append(cv_results, result)

    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    cv_results = np.array([])   

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison-AUC')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.grid()

plt.show()

enter image description here


Tags: namefromtestimportindexmodelnamesnp
1条回答
网友
1楼 · 发布于 2024-04-25 10:15:17

简而言之,您应该使用model.predict_proba(X_test)[:, 1]model.decision_function(X_test)来获得相同的结果,因为roc auc scorer需要类概率。答案很长,你可以用一个玩具的例子来重现同样的行为:

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, make_scorer

def assert_equal_scores(rnd_seed, needs_threshold):
    """Assert two different scorings, return equal results."""
    X, y, *_ = load_breast_cancer().values()
    kfold = KFold(random_state=rnd_seed)
    lr = LogisticRegression(random_state=rnd_seed + 10)
    roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=needs_threshold)
    cv_scores1 = cross_val_score(lr, X, y, cv=kfold, scoring=roc_auc_scorer)
    cv_scores2 = cross_val_score(lr, X, y, cv=kfold, scoring='roc_auc')
    np.testing.assert_equal(cv_scores1, cv_scores2)

尝试assert_equal_scores(10, False)assert_equal_scores(10, True)(或任何其他随机种子)。第一个引发了AssertionError。不同之处在于roc auc scorer要求needs_threshold参数为True。你知道吗

相关问题 更多 >