predict_proba()中排名前三的班级

2024-05-15 12:46:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究一个多类文本分类问题,该问题需要具有相应概率的前3个预测标签。我能够使用sklearn predict_proba(),但是很难像表a那样格式化输出。我的代码如下:

cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = None)

pipeline_sgd = Pipeline([
     ('vect', CountVectorizer()),
     ('tfdif', TfidfTransformer()),
     ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),
])
Model = pipeline_sgd.fit(X_train, y_train)

n_top_labels = 3
probas = model.predict_probas(test["text"])
top_n_lables_idx = probas.argsort()[::-1][:n_top_lables]
top_n_probs = probas[top_n_lables_idx]
top_n_labels = label_encoder.inverse_transform(top_n_lables_idx.ravel())

results = list(zip(top_n_labels, top_n_probas))

 

输出:

[(A, .80),
 (B, .10),
 (C, .10)]

我对上述输出的挑战是,它没有为每行文本提供前3个标签/概率。例如,当我对一组新文档(文本)运行推断时,我只得到一个输出,而不是每个文档(行)的一个输出

我面临的第二个挑战是,当我使用pd.Dataframe(data = results)将其插入数据帧时,我得到以下结果:

|   | 0 | 1               |
|---|---|-----------------|
| 0 | A | [[.80,.10,.10]] |
| 1 | B | [[.85,.10,.05]] |
| 2 | C | [[.70,.20,.10]] |

应该是:

|   | 0     | 1               |
|---|-------|-----------------|
| 0 | A,B,C | [[.80,.10,.10]] |
| 1 | B,C,A | [[.85,.10,.05]] |
| 2 | C,B,A | [[.70,.20,.10]] |

表A

| Text                                       | Predicted labels | Probabilities  |
|--------------------------------------------|------------------|----------------|
| Hello  World!                              | A,B,C            | [.80,.10,10]   |
| Have a nice Day!                           | B,C,A            | [.90,.05,05]   |
| It's a wonderful day in the neighborhood.  | C,A,B            | [.80,.10,10]   |

Tags: 文档文本labelspipelinetoptrain标签概率
2条回答

更新到已接受的答案

n = 3

probas = model.predict_proba(X_train)
top_n_lables_idx = np.argsort(-probas, axis=1)[:, :n]
top_n_probs = np.round(-np.sort(-probas),3)[:, :n]
top_n_labels = [model.classes_[i] for i in top_n_lables_idx]
    
results = list(zip(top_n_labels, top_n_probs))

pd.DataFrame(results)

这确保了我在两列中都获得了前3名

当我运行你的代码时,我有一个非常奇怪的top_n_probs形状,我发现很难找回标签。用于调用排序值的argsort和代码似乎有点奇怪

下面我写了一个应该可以工作的快速实现

使用示例dataset

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier

import pandas as pd
import numpy as np

df = pd.read_csv('./smsspamcollection//SMSSpamCollection', sep='\t', names=["label", "message"])
df['label'][df['label']=='ham'] = np.random.choice(['hamA','hamB'],np.sum(df['label']=='ham'))
X_train = df['message']
y_train = df['label']

我的标签如下所示:

df['label'].value_counts()

hamB    2425
hamA    2400
spam     747

并运行代码进行安装:

cv = StratifiedKFold(n_splits = 10, random_state = 42, shuffle = True)

pipeline_sgd = Pipeline([
     ('vect', CountVectorizer()),
     ('tfdif', TfidfTransformer()),
     ('nb', CalibratedClassifierCV(base_estimator = SGDClassifier(), cv=cv)),
])

model = pipeline_sgd.fit(X_train, y_train)

这应该起作用:

n_top_labels = 3
probas = model.predict_proba(X_train[:5])
top_n_lables_idx = np.argsort(-probas)
top_n_probs = np.round(-np.sort(-probas),3)
top_n_labels = [model.classes_[i] for i in top_n_lables_idx]

results = list(zip(top_n_labels, top_n_probs))

pd.DataFrame(results)

    0   1
0   [hamB, hamA, spam]  [0.608, 0.38, 0.012]
1   [hamA, hamB, spam]  [0.605, 0.391, 0.004]
2   [spam, hamB, hamA]  [0.603, 0.212, 0.185]
3   [hamB, hamA, spam]  [0.521, 0.478, 0.001]
4   [hamB, hamA, spam]  [0.645, 0.352, 0.003]

相关问题 更多 >