我正在对大量文本运行分类器,这导致内存错误。Python获得大约2gb的内存使用量,然后返回错误。在
我知道加载这么多数据然后试图处理它会导致错误,我只是不知道如何解决,我对python非常陌生。我想我需要“分块”文本输入或逐行处理文本,但我再次不确定如何在我现有的代码中实现这一点。任何帮助都将是惊人的。在
代码:
import nltk, pickle
from nltk.corpus import stopwords
customstopwords = []
p = open('', 'r')
postxt = p.readlines()
n = open('', 'r')
negtxt = n.readlines()
neglist = []
poslist = []
for i in range(0,len(negtxt)):
neglist.append('negative')
for i in range(0,len(postxt)):
poslist.append('positive')
postagged = zip(postxt, poslist)
negtagged = zip(negtxt, neglist)
print "STAGE ONE"
taggedtweets = postagged + negtagged
tweets = []
for (word, sentiment) in taggedtweets:
word_filter = [i.lower() for i in word.split()]
tweets.append((word_filter, sentiment))
def getwords(tweets):
allwords = []
for (words, sentiment) in tweets:
allwords.extend(words)
return allwords
def getwordfeatures(listoftweets):
wordfreq = nltk.FreqDist(listoftweets)
words = wordfreq.keys()
return words
wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in stopwords.words('english')]
wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in customstopwords]
print "STAGE TWO"
def feature_extractor(doc):
docwords = set(doc)
features = {}
for i in wordlist:
features['contains(%s)' % i] = (i in docwords)
return features
print "STAGE THREE"
training_set = nltk.classify.apply_features(feature_extractor, tweets)
print "STAGE FOUR"
classifier = nltk.NaiveBayesClassifier.train(training_set)
print "STAGE FIVE"
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
目前没有回答
相关问题 更多 >
编程相关推荐