使用NLTK对OCR未分割单词进行分词

5 投票

1 回答

1635 浏览

提问于 2025-04-18 04:23

我正在使用NLTK这个工具来处理从PDF文件中提取的一些文本。虽然我能大部分恢复文本，但有很多地方单词之间的空格没有被捕捉到，所以我得到的单词像是ifI而不是if I，或者thatposition而不是that position，还有andhe's而不是and he's。

我的问题是：我怎么能用NLTK来查找那些它不认识的单词，并看看有没有“附近”的单词组合更有可能出现？有没有比逐个字符检查不认识的单词、拆分它，然后看看能否变成两个认识的单词更优雅的方法来实现这种检查？

文本处理自然语言处理 nltk ocr 分词词汇分析

1 个回答

我建议你考虑使用 pyenchant，因为它是解决这类问题的更可靠的方案。你可以在这里下载 pyenchant。下面是安装后如何获取结果的一个例子：

>>> text = "IfI am inthat position, Idon't think I will."  # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
    for suggestion in error.suggest():
        if error.word.replace(' ', '') == suggestion.replace(' ', ''):  # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
            error.replace(suggestion)
            break
>>> checker.get_text()
"If I am in that position, I don't think I will."  # text is now fixed

回答于 2025-04-18 由 Python大师

分享举报

使用NLTK对OCR未分割单词进行分词

1 个回答

撰写回答