停用词 NLTK/Python 问题
我有一些代码用来处理一个数据集,以便以后使用。我用来处理停用词的代码看起来没问题,但我觉得问题出在我代码的其他部分,因为它似乎只去掉了一部分停用词。
import re
import nltk
# Quran subset
filename = 'subsetQuran.txt'
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list2:
# remove punctuation marks
word = punctuation.sub("", word)
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()
输出结果是这样的
[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')
这只是一个小样本,还有其他的停用词应该被去掉。希望能得到一些帮助。
1 个回答
4
试着在创建你的 word_list2 的时候,把单词前后的空格去掉。
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]