predict_proba（）中排名前三的班级

cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = None) pipeline_sgd = Pipeline([ ('vect', CountVectorizer()), ('tfdif', TfidfTransformer()), ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)), ]) Model = pipeline_sgd.fit(X_train, y_train) n_top_labels = 3 probas = model.predict_probas(test["text"]) top_n_lables_idx = probas.argsort()[::-1][:n_top_lables] top_n_probs = probas[top_n_lables_idx] top_n_labels = label_encoder.inverse_transform(top_n_lables_idx.ravel()) results = list(zip(top_n_labels, top_n_probas))

| Text | Predicted labels | Probabilities | |--------------------------------------------|------------------|----------------| | Hello World! | A,B,C | [.80,.10,10] | | Have a nice Day! | B,C,A | [.90,.05,05] | | It's a wonderful day in the neighborhood. | C,A,B | [.80,.10,10] |

2条回答

网友
1楼 · 编辑于 2024-05-15 12:46:43

更新到已接受的答案
n = 3 probas = model.predict_proba(X_train) top_n_lables_idx = np.argsort(-probas, axis=1)[:, :n] top_n_probs = np.round(-np.sort(-probas),3)[:, :n] top_n_labels = [model.classes_[i] for i in top_n_lables_idx] results = list(zip(top_n_labels, top_n_probs)) pd.DataFrame(results)
这确保了我在两列中都获得了前3名

网友
2楼 · 编辑于 2024-05-15 12:46:43

当我运行你的代码时，我有一个非常奇怪的top_n_probs形状，我发现很难找回标签。用于调用排序值的argsort和代码似乎有点奇怪
下面我写了一个应该可以工作的快速实现
使用示例dataset：
from sklearn.model_selection import StratifiedKFold from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer from sklearn.calibration import CalibratedClassifierCV from sklearn.linear_model import SGDClassifier import pandas as pd import numpy as np df = pd.read_csv('./smsspamcollection//SMSSpamCollection', sep='\t', names=["label", "message"]) df['label'][df['label']=='ham'] = np.random.choice(['hamA','hamB'],np.sum(df['label']=='ham')) X_train = df['message'] y_train = df['label']
我的标签如下所示：
df['label'].value_counts() hamB 2425 hamA 2400 spam 747
并运行代码进行安装：
cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = True) pipeline_sgd = Pipeline([ ('vect', CountVectorizer()), ('tfdif', TfidfTransformer()), ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)), ]) model = pipeline_sgd.fit(X_train, y_train)
这应该起作用：
n_top_labels = 3 probas = model.predict_proba(X_train[:5]) top_n_lables_idx = np.argsort(-probas) top_n_probs = np.round(-np.sort(-probas),3) top_n_labels = [model.classes_[i] for i in top_n_lables_idx] results = list(zip(top_n_labels, top_n_probs)) pd.DataFrame(results) 0 1 0 [hamB, hamA, spam] [0.608, 0.38, 0.012] 1 [hamA, hamB, spam] [0.605, 0.391, 0.004] 2 [spam, hamB, hamA] [0.603, 0.212, 0.185] 3 [hamB, hamA, spam] [0.521, 0.478, 0.001] 4 [hamB, hamA, spam] [0.645, 0.352, 0.003]

相关问题更多 >

编程相关推荐

热门问题

热门文章