如何从集合B中移除集合A中所有实例的单个元素？

0 投票

7 回答

974 浏览

提问于 2025-04-16 17:22

如你所见，当我打开test.txt文件并把里面的词放进一个集合时，返回的是这个集合和common_words集合的差异。不过，它只去掉了common_words集合中词的一个实例，而不是所有出现的地方。我该怎么做才能实现这个呢？我想从title_words中去掉common_words里的所有词。

from string import punctuation
from operator import itemgetter

N = 10
words = {}

linestring = open('test.txt', 'r').read()

//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

title = linestring

//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

集合操作数据处理文本分析词汇过滤集合差异

7 个回答

你只需要用到 difference() 这个方法，但看起来你的例子有点问题。

title_words 是一个集合，而集合是没有 strip() 这个方法的。

试试这个：

title_words = set(title.lower().split())
keywords = title_words.difference(common_words)

回答于 2025-04-16 由 Python大师

分享举报

我同意senderle的看法。试试这段代码：

for common_word in common_words:
    try:
        title.words.remove(common_word)
    except:
        print "The common word %s was not in title_words" %common_word

这样应该就可以了

希望这对你有帮助

回答于 2025-04-16 由 Python大师

分享举报

我最近写了一些代码，做的事情跟你说的有点像，不过风格跟你很不一样。也许这对你有帮助。

import string
import sys

def main():
    # get some stop words
    stopf = open('stop_words.txt', "r")
    stopwords = {}
    for s in stopf:
        stopwords[string.strip(s)] = 1

    file = open(sys.argv[1], "r")
    filedata = file.read()
    words=string.split(filedata)
    histogram = {}
    count = 0
    for word in words:
        word = string.strip(word, string.punctuation)
        word = string.lower(word)
        if word in stopwords:
            continue
        histogram[word] = histogram.get(word, 0) + 1
        count = (count+1) % 1000
        if count == 0:
            print '*',
    flist = []
    for word, count in histogram.items():
        flist.append([count, word])
    flist.sort()
    flist.reverse()
    for pair in flist[0:100]:
        print "%30s: %4d" % (pair[1], pair[0])

main()

回答于 2025-04-16 由 Python大师

分享举报

如何从集合B中移除集合A中所有实例的单个元素？

7 个回答

撰写回答