NLTK 朴素贝叶斯分类器输入格式化
我现在遇到一个问题,完全搞不懂。我对Python和NLTK还比较陌生,想做一个朴素贝叶斯分类器,但不确定输入应该是什么格式,是一组元组的列表,还是字典,或者是一个包含两个列表的元组。
我试过下面的格式,结果报错了,错误信息是 AttributeError: 'str' object has no attribute 'items'
[('maggie: just a push button. and the electric car uses sensors to drive itself. \n', 'notending')]
下面这种格式也报错,错误信息是 AttributeError: 'list' object has no attribute 'items'
[([['the', 'fire', 'chief', 'says', 'someone', 'started', 'the', 'blaze', 'on', 'purpose', 'as', 'a', 'controlled', 'burn', ',', 'but', 'it', 'quickly', 'got', 'out', 'of', 'hand', '.']], 'notending')]
如果我用字典的话,又会出现这个错误 ValueError: too many values to unpack
{'everyone: bye!': 'ending'}
我调用朴素贝叶斯分类器的代码是 classifier = nltk.NaiveBayesClassifier.train(d_train)
我不太确定哪里出了问题。非常感谢大家的帮助!
1 个回答
6
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
stopset = list(set(stopwords.words('english')))
def word_feats(words):
return dict([(word, True) for word in words.split() if word not in stopset])
posids = ['I love this sandwich.', 'I feel very good about these beers.']
negids = ['I hate this sandwich.', 'I feel worst about these beers.']
pos_feats = [(word_feats(f), 'positive') for f in posids ]
neg_feats = [(word_feats(f), 'negative') for f in negids ]
print pos_feats
print neg_feats
trainfeats = pos_feats + neg_feats
classifier = NaiveBayesClassifier.train(trainfeats)
看看正面和负面的特征
[({'I': True, 'love': True, 'sandwich.': True}, 'positive'), ({'I': True, 'feel': True, 'good': True, 'beers.': True}, 'positive')]
[({'I': True, 'hate': True, 'sandwich.': True}, 'negative'), ({'I': True, 'feel': True, 'beers.': True, 'worst': True}, 'negative')]
所以,如果你给系统一句话'我讨厌一切'来分类
print classifier.classify(word_feats('I hate everything'))
你会得到的结果是'负面'。