从一个字符串中获取不在另一个列表中的单词列表

2 投票

4 回答

572 浏览

提问于 2025-04-18 11:15

我有一个非常长的字符串，叫做 tekst（从一个文件中读取，大小为600 MB），还有一个包含11,000个单词的列表，叫做 nlwoorden。我想要从 tekst 中提取出所有的内容，但不包括 nlwoorden 中的单词。

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

这样做可以得到我想要的结果。不过，这样计算起来显然非常慢。有没有更高效的方法呢？

谢谢！

字符串处理数据处理高效算法单词过滤文本分析

4 个回答

这段代码：

woord not in nlwoorden

每次调用时会花费O(N)的时间，其中N = len(nlwoorden)。

所以你的列表推导式：

belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]

总共需要O(N * M)的时间，其中M = len(tekst.split())。

这是因为nlwoorden是一个列表，而不是一个集合。要检查一个元素是否在一个无序的列表中，使用简单的方法的话，最坏情况下你得遍历整个列表。

这就是为什么在处理大输入时，你的代码运行得很慢。

如果你使用的是哈希集合，那么在集合构建完成后，检查一个元素是否存在就会很快，几乎是固定的时间。

所以，简单的代码形式可以是这样的：

import io

def words(fileobj):
    for line in fileobj:             # takes care of buffering large files, chunks at a time
        for word in line.split():
            yield word

# first, build the set of whitelisted words
wpath = 'whitelist.txt'
wset = set()
with io.open(wpath, mode='rb') as w:
    for word in words(w):
        wset.add(word)

def emit(word):
    # output 'word' - to a list, to another file, to a pipe, etc
    print word

fpath = 'input.txt'
with io.open(fpath, mode='rb') as f:
    for word in words(f):               # total run time - O(M) where M = len(words(f))
        if word not in wset:            # testing for membership in a hash set - O(1)
            emit(word)

回答于 2025-04-18 由 Python大师

分享举报

逐行读取和处理 "woorden.txt" 文件

把所有的 nlwoorden 加载到一个集合里（这样比放到列表里更有效率）
分部分读取这个大文件，对每一部分进行拆分，只把不在 lnwoorden 中的内容写入结果文件。

假设你这个600 MB的大文件里的每一行都不是特别长（不是600 MB长），我会这样做：

nlwoorden = set()
with open("nlwoorden.txt") as f:
    for line in f:
        nlwoorden.update(line.split())

with open("woorden.txt") as f, with open("out.txt", "w") as fo:
    for line in f:
        newwords = set(line.split())
        newwords.difference_update(nlwoorden)
        fo.write(" ".join(newwords)

总结

这个方法不会消耗太多内存，因为你不会一次性把 "woorden.txt" 的所有数据都读进来。

如果你的文件不是按行分割的，你需要改变读取文件部分的方式。但我假设你的文件是有换行符的。

回答于 2025-04-18 由 Python大师

分享举报

我觉得最简单的方法就是使用集合。比如，

s = "This is a test"
s2 = ["This", "is", "another", "test"]
set(s.split()) - set(s2)

# returns {'a'}

不过，考虑到你的文本大小，使用生成器可能会更好，这样可以避免一次性把所有东西都放在内存里，比如，

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()

[word for word in itersplit(s) if word not in s2]

# returns ['a']

回答于 2025-04-18 由 Python大师

分享举报

使用基于集合的解决方案可以让你在整个过程中花费的时间是 O(len(nlwoorden))。这意味着处理的时间与单词的数量成正比。为了创建这两个集合，还需要额外的 O(len(nlwoorden)) + O(len(tekst)) 的时间。

所以你想要的代码片段基本上就是评论中提到的那个：

belangrijk=list(set(tekst.split()) - set(nlwoorden))

（假设你最后想把结果再变成一个列表）

回答于 2025-04-18 由 Python大师

分享举报

从一个字符串中获取不在另一个列表中的单词列表

4 个回答

逐行读取和处理 "woorden.txt" 文件

总结

撰写回答