预测概率或决策函数作为估计器“置信度”

2024-04-27 17:56:28 发布

男 | 程序猿一只，喜欢编程写python代码。

我使用逻辑回归作为一个模型来训练scikit learn中的估计员。我使用的特性（大部分）是分类的；标签也是。因此，我分别使用DictVectorizer和LabelEncoder对值进行正确编码。在

培训部分相当简单，但我在测试部分遇到了问题。简单的做法是使用训练模型的“预测”方法，得到预测的标签。但是，对于以后需要进行的处理，我需要每个特定实例的每个可能的标签（类）的概率。我决定用“预测概率”的方法。但是，对于同一个测试实例，我得到的结果是不同的，无论是在实例是单独使用还是由其他实例陪同时使用此方法。在

接下来，是一个重现问题的代码。在

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder


X_real = [{'head': u'n\xe3o', 'dep_rel': u'ADVL'}, 
          {'head': u'v\xe3o', 'dep_rel': u'ACC'}, 
          {'head': u'empresa', 'dep_rel': u'SUBJ'}, 
          {'head': u'era', 'dep_rel': u'ACC'}, 
          {'head': u't\xeam', 'dep_rel': u'ACC'}, 
          {'head': u'import\xe2ncia', 'dep_rel': u'PIV'}, 
          {'head': u'balan\xe7o', 'dep_rel': u'SUBJ'}, 
          {'head': u'ocupam', 'dep_rel': u'ACC'}, 
          {'head': u'acesso', 'dep_rel': u'PRED'}, 
          {'head': u'elas', 'dep_rel': u'SUBJ'}, 
          {'head': u'assinaram', 'dep_rel': u'ACC'}, 
          {'head': u'agredido', 'dep_rel': u'SUBJ'}, 
          {'head': u'pol\xedcia', 'dep_rel': u'ADVL'}, 
          {'head': u'se', 'dep_rel': u'ACC'}] 
y_real = [u'AM-NEG', u'A1', u'A0', u'A1', u'A1', u'A1', u'A0', u'A1', u'AM-ADV', u'A0', u'A1', u'A0', u'A2', u'A1']

feat_encoder =  DictVectorizer()
feat_encoder.fit(X_real)

label_encoder = LabelEncoder()
label_encoder.fit(y_real)

model = LogisticRegression()
model.fit(feat_encoder.transform(X_real), label_encoder.transform(y_real))

print "Test 1..."
X_test1 = [{'head': u'governo', 'dep_rel': u'SUBJ'}]
X_test1_encoded = feat_encoder.transform(X_test1)
print "Features Encoded"
print X_test1_encoded
print "Shape"
print X_test1_encoded.shape
print "decision_function:"
print model.decision_function(X_test1_encoded)
print "predict_proba:"
print model.predict_proba(X_test1_encoded)

print "Test 2..."
X_test2 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'}]

X_test2_encoded = feat_encoder.transform(X_test2)
print "Features Encoded"
print X_test2_encoded
print "Shape"
print X_test2_encoded.shape
print "decision_function:"
print model.decision_function(X_test2_encoded)
print "predict_proba:"
print model.predict_proba(X_test2_encoded)


print "Test 3..."
X_test3 = [{'head': u'governo', 'dep_rel': u'SUBJ'}, 
           {'head': u'atrav\xe9s', 'dep_rel': u'ADVL'}, 
           {'head': u'configuram', 'dep_rel': u'ACC'},
           {'head': u'configuram', 'dep_rel': u'ACC'},]

X_test3_encoded = feat_encoder.transform(X_test3)
print "Features Encoded"
print X_test3_encoded
print "Shape"
print X_test3_encoded.shape
print "decision_function:"
print model.decision_function(X_test3_encoded)
print "predict_proba:"
print model.predict_proba(X_test3_encoded)

获得的输出如下：

^{pr2}$

如图所示，当同一个实例与X_test2中的其他实例一起使用“predict_proba”获得的值会发生变化。另外，“X_test3”只是复制了“X_test2”并添加了一个实例（相当于“X_test2”中的最后一个实例），但是所有实例的概率值都会发生变化。为什么会这样？另外，我发现很奇怪“X_test1”的所有概率都是1，难道所有的总和都是1吗？在

现在，如果不使用“predict_proba”而使用“decision_function”，我就得到了所需值的一致性。问题是我得到了负的系数，甚至一些正的系数也大于1。在

那么，我应该用什么呢？为什么“预测概率”的值会发生这种变化？我没有正确理解这些价值观的含义吗？在

提前谢谢你能给我的任何帮助。在

更新

正如建议的那样，我修改了代码，以便同时打印编码的“X_test1”、“X_test2”和“X_test3”，以及它们的形状。这似乎不是问题所在，因为测试集之间的相同实例的编码是一致的。在

Tags：实例 encoder model predict head rel acc print

1条回答

网友

1楼 · 发布于 2024-04-27 17:56:28

正如问题的注释所示，这个错误是由我使用的scikitlearn版本的实现中的一个bug引起的。更新到最新的稳定版本0.12.1解决了问题

预测概率或决策函数作为估计器“置信度”

相关问题更多 >

编程相关推荐

热门问题

热门文章

预测概率或决策函数作为估计器“置信度”

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >