如何在NLTK中创建情感分析语料库？

Traceback (most recent call last): File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 42, in <module> short_pos = open("short_reviews/pos.txt", "r").read IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'

编辑：

谢谢你的回复。在

我接受了你的建议，把文件夹从NLTK的语料库中移走了。在

我一直在尝试我的文件夹位置，我得到了不同的回溯。在

如果你是说最好的方法是使用纯文本corpusReader，那么就这样吧；但是，也许对于我的应用程序，我想使用CategorizedPlaintextCorpusReader？在

在系统argv绝对不是我的意思，所以我可以稍后再读。在

首先，这里是我的代码，我没有尝试使用纯文本corpusReader，这导致了上面的回溯，当文件夹“short_reviews”包含位置文本以及阴性.txt文件不在NLP文件夹中：

但是，当我使用与上述相同的代码将包含文本文件的“short_reviews”文件夹移动到NLP文件夹中时，会出现以下情况：

Traceback (most recent call last): File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 47, in <module> for r in short_pos.split('\n'): AttributeError: 'builtin_function_or_method' object has no attribute 'split'

当我使用以下代码，使用PlaintextCorpusReader将包含文本文件的“short_reviews”文件夹移动到NLP文件夹时，会发生以下回溯：

import nltk import random from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier import pickle from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from nltk.classify import ClassifierI from statistics import mode from nltk import word_tokenize from nltk.corpus import PlaintextCorpusReader corpus_root = 'short_reviews' word_lists = PlaintextCorpusReader(corpus_root, '*') wordlists.fileids() class VoteClassifier(ClassifierI): def __init__(self, *classifiers): self._classifiers = classifiers def classify(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) return mode(votes) def confidence(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) choice_votes = votes.count(mode(votes)) conf = choice_votes / len(votes) return conf # def main(): # file = open("short_reviews/pos.txt", "r") # short_pos = file.readlines() # file.close short_pos = open("short_reviews/pos.txt", "r").read short_neg = open("short_reviews/neg.txt", "r").read documents = [] for r in short_pos.split('\n'): documents.append((r, "pos")) for r in short_neg.split('\n'): documents.append((r, "neg")) all_words = [] short_pos_words = word.tokenize(short_pos) short_neg_words = word.tokenize(short_neg) for w in short_pos_words: all_words.append(w. lower()) for w in short_neg_words: all_words.append(w. lower()) all_words = nltk.FreqDist(all_words) Traceback (most recent call last): File "/Users/jordanXXX/Documents/NLP/bettertrainingdata2", line 18, in <module> word_lists = PlaintextCorpusReader(corpus_root, '*') File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 62, in __init__ CorpusReader.__init__(self, root, fileids, encoding) File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 87, in __init__ fileids = find_corpus_fileids(root, fileids) File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 763, in find_corpus_fileids if re.match(regexp, prefix+fileid)] File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match return _compile(pattern, flags).match(string) File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile raise error, v # invalid expression error: nothing to repeat

1条回答

网友

1楼 · 发布于 2024-04-20 09:39:01

你提到的答案包含了一些非常糟糕（或者说，不适用）的建议。没有理由将自己的语料库放在nltk_data中，也没有理由像本地语料库一样通过黑客攻击nltk.corpus.__init__.py来加载它。事实上，不要做这些事情。在

您应该使用PlaintextCorpusReader。我不理解你不愿意这么做，但是如果你的文件是纯文本的，那它就是正确的工具。假设您有一个文件夹NLP/bettertrainingdata，您可以构建一个读卡器，它将加载该文件夹中的所有.txt文件，如下所示：

myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")

如果您向文件夹中添加新文件，读者将找到并使用它们。如果您想要的是能够在其他文件夹中使用脚本，那么只要这样做您就不需要另一个阅读器，您需要了解sys.argv。如果您在使用pos.txt和neg.txt的分类语料库，那么您需要一个CategorizedPlaintextCorpusReader（见）。如果你还想要别的东西，那么请编辑你的问题来解释你想做什么。在

编辑：

相关问题更多 >

编程相关推荐

热门问题

热门文章