用于情绪分析的nltk朴素贝叶斯分类训练

3条回答

网友

1楼 · 编辑于 2024-05-26 09:17:52

似乎您正在尝试使用TextBlob，但正在培训NLTK NaiveBayesClassifier，正如在其他答案中指出的，它必须通过一个功能字典。

TextBlob有一个默认的特征提取程序，用于指示文档中包含培训集中的哪些单词（如其他答案中所示）。因此，TextBlob允许您按原样传入数据。

from textblob.classifiers import NaiveBayesClassifier

train = [('This is an amazing place!', 'pos'),
        ('I feel very good about these beers.', 'pos'),
        ('This is my best work.', 'pos'),
        ("What an awesome view", 'pos'),
        ('I do not like this restaurant', 'neg'),
        ('I am tired of this stuff.', 'neg'),
        ("I can't deal with this", 'neg'),
        ('He is my sworn enemy!', 'neg'),
        ('My boss is horrible.', 'neg') ] 
test = [
        ('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ] 


classifier = NaiveBayesClassifier(train)  # Pass in data as is
# When classifying text, features are extracted automatically
classifier.classify("This is an amazing library!")  # => 'pos'

当然，简单的默认提取器并不适合所有问题。如果您想知道如何提取特征，只需编写一个函数，该函数以文本字符串作为输入，输出特征字典并将其传递给分类器。

classifier = NaiveBayesClassifier(train, feature_extractor=my_extractor_func)

我建议您在这里查看短TextBlob分类器教程：http://textblob.readthedocs.org/en/latest/classifiers.html

网友

2楼 · 编辑于 2024-05-26 09:17:52

@275365的NLTK贝叶斯分类器数据结构教程很棒。从更高的层面来看，我们可以把它看作

我们有带有情感标签的输入语句：

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

让我们将我们的功能集视为单个单词，因此我们从训练数据（我们称之为词汇表）中提取所有可能单词的列表，如下所示：

from nltk.tokenize import word_tokenize
from itertools import chain
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

本质上，这里的vocabulary与@275365的all_word是相同的

>>> all_words = set(word.lower() for passage in training_data for word in word_tokenize(passage[0]))
>>> vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
>>> print vocabulary == all_words
True

从每个数据点（即每个句子和pos/neg标记），我们想说一个特征（即词汇中的一个词）是否存在。

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> print {i:True for i in vocabulary if i in sentence}
{'this': True, 'i': True, 'sandwich': True, 'love': True, '.': True}

但我们也要告诉分类器哪些单词不存在于句子中，而是存在于词汇中，因此对于每个数据点，我们列出词汇中所有可能的单词，并说明某个单词是否存在：

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x =  {i:True for i in vocabulary if i in sentence}
>>> y =  {i:False for i in vocabulary if i not in sentence}
>>> x.update(y)
>>> print x
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

但由于这两次循环遍历词汇表，这样做更有效：

>>> sentence = word_tokenize('I love this sandwich.'.lower())
>>> x = {i:(i in sentence) for i in vocabulary}
{'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'good': False, 'best': False, '!': False, 'these': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'ca': False, 'do': False, 'sandwich': True, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'i': True, 'stuff': False, 'place': False, 'my': False, 'awesome': False, 'view': False}

所以对于每个句子，我们要告诉每个句子的分类器哪些单词存在，哪些单词不存在，并给它pos/neg标记。我们可以称之为afeature_set，它是由ax（如上图所示）及其标记组成的元组。

>>> feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]

然后，我们将特征集中的这些特征和标记输入分类器以对其进行训练：

from nltk import NaiveBayesClassifier as nbc
classifier = nbc.train(feature_set)

现在你有了一个训练有素的分类器，当你想标记一个新的句子时，你必须对新的句子进行“特征化”以查看新句子中的哪个单词在分类器所训练的词汇中：

>>> test_sentence = "This is the best band I've ever heard! foobar"
>>> featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

注意：正如您从上面的步骤中看到的，naive bayes分类器无法处理词汇表外的单词，因为foobar标记在您对其进行特化后消失。

然后你把这个特殊的测试句子输入分类器，并要求它进行分类：

>>> classifier.classify(featurized_test_sentence)
'pos'

希望这能更清楚地说明如何将数据输入NLTK的朴素bayes分类器进行情感分析。以下是没有注释和演练的完整代码：

from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain

training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))

feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]

classifier = nbc.train(feature_set)

test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence =  {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}

print "test_sent:",test_sentence
print "tag:",classifier.classify(featurized_test_sentence)

网友

3楼 · 编辑于 2024-05-26 09:17:52

你需要改变你的数据结构。这是您当前的train列表：

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

但问题是，每个元组的第一个元素应该是一个功能字典。因此，我将把您的列表更改为分类器可以使用的数据结构：

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

现在，您的数据的结构应该如下所示：

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

注意，每个元组的第一个元素现在是一个字典。现在您的数据已经就位，并且每个元组的第一个元素是一个字典，您可以像这样训练分类器：

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0

如果你想使用分类器，你可以这样做。首先，从一个测试句子开始：

>>> test_sentence = "This is the best band I've ever heard!"

然后，标记句子并找出句子与所有单词共享的单词。这些构成了句子的特点。

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

现在您的功能如下：

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

然后，您只需对这些功能进行分类：

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above

这个测试句子似乎是肯定的。

相关问题更多 >

编程相关推荐

热门问题

热门文章