从单词列表中查找文本文件中的单词
我正在尝试写一个程序,这个程序可以读取一个文本文件,然后把里面的评论分成积极、消极或中立三类。我试过很多方法,但每次都没有成功。我可以很容易地搜索一个单词,但如果要搜索多个单词就不行了。此外,我有一个if语句,但我不得不在下面用else写了两次,因为它不让我用elif。如果你能告诉我哪里出错了,我会非常感激。提前谢谢你。
middle = open("middle_test.txt", "r")
positive = []
negative = [] #the empty lists
neutral = []
pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"] #the lists I'd like to search
neg_words = ["BAD", "HATE", "SUCKS", "CRAP"]
for tweet in middle:
words = tweet.split()
if pos_words in words: #doesn't work
positive.append(words)
else: #can't use elif for some reason
if 'BAD' in words: #works but is only 1 word not list
negative.append(words)
else:
neutral.append(words)
5 个回答
0
你遇到了一些问题。首先,你可以创建一些函数,这些函数可以从文件中读取评论,并把评论分成一个个单词。先把这些函数做出来,然后检查它们是否按你的想法工作。接下来,主要的程序可以像这样:
for comment in get_comments(file_name):
words = get_words(comment)
classified = False
# at first look for negative comment
for neg_word in NEGATIVE_WORDS:
if neg_word in words:
classified = True
negatives.append(comment)
break
# now look for positive
if not classified:
for pos_word in POSITIVE_WORDS:
if pos_word in words:
classified = True
positives.append(comment)
break
if not classified:
neutral.append(comment)
0
你可以使用下面的代码来计算一段文字中正面和负面词汇的数量:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
# >cat positive.txt
# good
# awesome
# >cat negative.txt
# bad
# ugly
positive = readwords('positive.txt')
negative = readwords('negative.txt')
print positive
print negative
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
0
要小心,open() 函数会返回一个文件对象。
>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>
使用这个:
>>> f.readline()
'This is the first line of the file.\n'
然后使用集合交集:
positive += list(set(pos_words) & set(tweet.split()))
0
你没有从文件中读取内容。而这一行
if pos_words in words:
我觉得它是在检查单词列表 ["GOOD", "GREAT", "LOVE", "AWESOME"]。也就是说,你是在单词列表中查找 pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"] 这个列表。
1
使用一个 Counter
,具体可以查看这个链接:http://docs.python.org/2/library/collections.html#collections.Counter:
import urllib2
from collections import Counter
from string import punctuation
# data from http://inclass.kaggle.com/c/si650winter11/data
target_url = "http://goo.gl/oMufKm"
data = urllib2.urlopen(target_url).read()
word_freq = Counter([i.lower().strip(punctuation) for i in data.split()])
pos_words = ["good", "great", "love", "awesome"]
neg_words = ["bad", "hate", "sucks", "crap"]
for i in pos_words:
try:
print i, word_freq[i]
except: # if word not in data
pass
[输出结果]:
good 638
great 1082
love 7716
awesome 2032