在文档中索引单词的最有效方法是什么？

7 投票

2 回答

3706 浏览

提问于 2025-04-17 05:41

这个问题在另一个讨论中提到过，但我觉得单独问一下更好。假设你有一个很大的句子列表（大约十万个句子）：

[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]

那么，编写以下函数的最佳方法是什么呢？

def GetSentences(word1, word2, position):
    return ""

这个函数需要接收两个单词，word1 和 word2，还有一个位置 position，然后返回所有符合这个条件的句子列表。例如：

GetSentences("sentence", "another", 3)

这个函数应该返回句子 1 和 3 的索引。我的当前做法是使用一个字典，像这样：

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))

for sentenceIndex, sentence in enumerate(sentences):
    words = sentence.split()
    for index, word in enumerate(words):
        for i, word2 in enumerate(words[index:):
            Index[word][word2][i+1].append(sentenceIndex)

但是在处理大约 130 MB 大小的数据集时，这种方法很快就会让一切失控，因为我的 48GB 内存在不到 5 分钟内就被耗尽了。我感觉这可能是一个常见的问题，但找不到任何有效的解决方案。有没有什么建议可以帮助我解决这个问题呢？

性能优化数据结构内存管理文本处理函数设计字典大数据索引技术

2 个回答

这是我在Python中实现的方法。不过，如果你需要做这个操作不止一次，使用数据库管理系统（DBMS）会更合适。不过对我来说，这个方法在处理一百万行数据时效果还不错。

sentences = [
    "This is sentence 1 as an example",
    "This is sentence 1 as another example",
    "This is sentence 2",
    "This is sentence 3 as another example ",
    "This is sentence 4"
    ]

sentences = sentences * 200 * 1000

sentencesProcessed = []

def preprocess():
    global sentences
    global sentencesProcessed
    # may want to do a regex split on whitespace
    sentencesProcessed = [sentence.split(" ") for sentence in sentences]

    # can deallocate sentences now
    sentences = None


def GetSentences(word1, word2, position):
    results = []
    for sentenceIndex, sentence in enumerate(sentencesProcessed):
        for wordIndex, word in enumerate(sentence[:-position]):
            if word == word1 and sentence[wordIndex + position] == word2:
                results.append(sentenceIndex)
    return results

def main():
    preprocess()
    results = GetSentences("sentence", "another", 3)
    print "Got", len(results), "results"

if __name__ == "__main__":
    main()

回答于 2025-04-17 由 Python大师

分享举报

使用数据库来存储数据。

首先把所有句子放到一个表里（每个句子应该有个ID）。你可以把这个表叫做 sentences。
其次，创建一个包含所有句子中单词的表（可以叫它 words，每个单词也要有个ID），并在另一个表中保存句子表和单词表之间的关系（可以叫它 sentences_words，这个表应该有两列，最好是 word_id 和 sentence_id）。
当你要查找包含所有指定单词的句子时，你的工作会变得简单：
1. 首先从 words 表中找到你要搜索的单词。查询可能看起来像这样：
```
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
```
2. 接下来，你需要从 sentences 表中找到那些有你需要的 word_id 值的 sentence_id（这些 word_id 对应于 words 表中的单词）。最初的查询可能像这样：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
```
  这个查询可以简化成：
```
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN (
    SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
);
```
3. 在Python中过滤结果，只返回那些包含你需要的所有 word_id 的 sentence_id 值。

这基本上是一个基于将大量数据存储在最适合的形式——数据库中的解决方案。

编辑：

如果你只搜索两个单词，你可以在数据库管理系统（DBMS）那边做更多的事情（几乎所有的操作）。
考虑到你还需要单词的位置差异，你应该在 sentences_words 表的第三列中存储单词的位置（我们就叫它 position），在搜索合适的单词时，你应该计算与这两个单词相关的这个值的差异。

回答于 2025-04-17 由 Python大师

分享举报

在文档中索引单词的最有效方法是什么？

2 个回答

撰写回答