如何从集合B中移除集合A中所有实例的单个元素?

0 投票
7 回答
974 浏览
提问于 2025-04-16 17:22

如你所见,当我打开test.txt文件并把里面的词放进一个集合时,返回的是这个集合和common_words集合的差异。不过,它只去掉了common_words集合中词的一个实例,而不是所有出现的地方。我该怎么做才能实现这个呢?我想从title_words中去掉common_words里的所有词。

from string import punctuation
from operator import itemgetter

N = 10
words = {}

linestring = open('test.txt', 'r').read()

//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

title = linestring

//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

7 个回答

1

你只需要用到 difference() 这个方法,但看起来你的例子有点问题。

title_words 是一个集合,而集合是没有 strip() 这个方法的。

试试这个:

title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
1

我同意senderle的看法。试试这段代码:

for common_word in common_words:
    try:
        title.words.remove(common_word)
    except:
        print "The common word %s was not in title_words" %common_word

这样应该就可以了

希望这对你有帮助

0

我最近写了一些代码,做的事情跟你说的有点像,不过风格跟你很不一样。也许这对你有帮助。

import string
import sys

def main():
    # get some stop words
    stopf = open('stop_words.txt', "r")
    stopwords = {}
    for s in stopf:
        stopwords[string.strip(s)] = 1

    file = open(sys.argv[1], "r")
    filedata = file.read()
    words=string.split(filedata)
    histogram = {}
    count = 0
    for word in words:
        word = string.strip(word, string.punctuation)
        word = string.lower(word)
        if word in stopwords:
            continue
        histogram[word] = histogram.get(word, 0) + 1
        count = (count+1) % 1000
        if count == 0:
            print '*',
    flist = []
    for word, count in histogram.items():
        flist.append([count, word])
    flist.sort()
    flist.reverse()
    for pair in flist[0:100]:
        print "%30s: %4d" % (pair[1], pair[0])

main()

撰写回答