有没有简单的方法在Python中从不带空格的句子生成可能的单词列表？

10 投票

2 回答

685 浏览

提问于 2025-04-17 18:48

我有一些文字：

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想把这些文字拆分成一个个单词。我快速查看了一下 enchant 和 nltk，但没找到什么看起来有用的东西。如果我有时间的话，我会考虑写一个动态程序，利用 enchant 检查一个单词是否是英文。我本以为网上会有可以做到这一点的工具，我错了吗？

文本处理自然语言处理动态编程拼写检查单词拆分

2 个回答

这个问题在亚洲的自然语言处理（NLP）中经常出现。如果你有一个字典的话，可以使用这个http://code.google.com/p/mini-segmenter/（声明：这是我写的，希望你不介意）。

需要注意的是，搜索的范围可能会非常大，因为英文字母的字符数量肯定比中文或日文的音节要多。

回答于 2025-04-17 由 Python大师

分享举报

使用字典树的贪心算法

可以试试用 Biopython 这个库（用 pip install biopython 安装）：

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

结果

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

注意事项

在英语中，有些特殊情况这个方法可能不适用。你需要用回溯的方法来处理这些情况，但这个方法可以帮助你入门。

必做测试

>>> main("expertsexchange")
experts
exchange