处理LDA分析中的大量单词（>1亿）时的内存错误处理

1条回答

网友

1楼 · 发布于 2024-04-26 18:09:12

So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.

如果您需要的唯一输出是包含单词计数的字典，我将执行以下操作：

循环逐个处理文件。这样你只在内存中存储一个文件。处理它，然后转到下一个：

# for all files in your directory/directories:
with open(current_file, 'r') as f:
    for line in f:
        # your logic to update the dictionary with the word count

# here the file is closed and the loop moves to the next one

编辑：当涉及到在内存中保存一个非常大的字典的问题时，您必须记住Python为保持dict低密度保留了大量内存—这是快速查找的代价。因此，您必须寻找另一种存储键值对的方法，例如元组列表，但是查找的代价会慢得多。This question就是关于这一点的，并且有一些不错的替代品在那里描述。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

处理LDA分析中的大量单词（>1亿）时的内存错误处理

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >