如何从集合B中移除集合A中所有实例的单个元素?
如你所见,当我打开test.txt文件并把里面的词放进一个集合时,返回的是这个集合和common_words集合的差异。不过,它只去掉了common_words集合中词的一个实例,而不是所有出现的地方。我该怎么做才能实现这个呢?我想从title_words中去掉common_words里的所有词。
from string import punctuation
from operator import itemgetter
N = 10
words = {}
linestring = open('test.txt', 'r').read()
//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))
title = linestring
//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
words_gen = (word.strip(punctuation).lower() for line in keywords
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
7 个回答
1
你只需要用到 difference()
这个方法,但看起来你的例子有点问题。
title_words
是一个集合,而集合是没有 strip()
这个方法的。
试试这个:
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
1
我同意senderle的看法。试试这段代码:
for common_word in common_words:
try:
title.words.remove(common_word)
except:
print "The common word %s was not in title_words" %common_word
这样应该就可以了
希望这对你有帮助
0
我最近写了一些代码,做的事情跟你说的有点像,不过风格跟你很不一样。也许这对你有帮助。
import string
import sys
def main():
# get some stop words
stopf = open('stop_words.txt', "r")
stopwords = {}
for s in stopf:
stopwords[string.strip(s)] = 1
file = open(sys.argv[1], "r")
filedata = file.read()
words=string.split(filedata)
histogram = {}
count = 0
for word in words:
word = string.strip(word, string.punctuation)
word = string.lower(word)
if word in stopwords:
continue
histogram[word] = histogram.get(word, 0) + 1
count = (count+1) % 1000
if count == 0:
print '*',
flist = []
for word, count in histogram.items():
flist.append([count, word])
flist.sort()
flist.reverse()
for pair in flist[0:100]:
print "%30s: %4d" % (pair[1], pair[0])
main()