如何将没有空格的文本拆分成单词列表？

1条回答

网友

1楼 · 发布于 2024-05-23 18:23:52

一个简单的算法在应用于实际数据时不会有好的效果。这里有一个20行的算法，它利用相对词频给出真实单词文本的准确结果。

（如果你想要一个不使用词频的原始问题的答案，你需要细化“最长单词”的确切含义：是最好有一个20个字母的单词和十个3个字母的单词，还是最好有五个10个字母的单词？一旦你确定了一个精确的定义，你只需要改变定义wordcost的行来反映预期的含义

这个想法

最好的方法是建立输出分布模型。一个好的第一近似是假设所有单词都是独立分布的。那么你只需要知道所有单词的相对频率。可以合理地假设它们遵循Zipf定律，即单词列表中的秩n的单词具有大约1/（nlogn）的概率，其中n是字典中的单词数。

一旦修复了模型，就可以使用动态编程来推断空间的位置。最有可能的句子是最大化每个单词的概率乘积的句子，用动态规划很容易计算出来。我们不直接使用概率，而是使用定义为概率倒数对数的成本来避免溢出。

守则

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

你可以用它

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

结果

我使用的this quick-and-dirty 125k-word dictionary I put together来自维基百科的一小部分。

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

正如你所看到的，它本质上是完美的。最重要的是要确保你的单词列表被训练成与你实际遇到的相似的语料库，否则结果会非常糟糕。

优化

该实现消耗了线性数量的时间和内存，因此效率相当高。如果需要进一步加速，可以从单词列表中构建后缀树，以减小候选集的大小。

如果您需要处理一个非常大的连续字符串，可以合理地分割该字符串，以避免过度使用内存。例如，您可以处理10000个字符加上两边1000个字符的文本块，以避免边界效应。这将使内存使用率保持在最低水平，并且几乎肯定不会影响质量。

网友
2楼 · 发布于 2024-05-23 18:23:52

下面是使用递归搜索的解决方案：
def find_words(instring, prefix = '', words = None): if not instring: return [] if words is None: words = set() with open('/usr/share/dict/words') as f: for line in f: words.add(line.strip()) if (not prefix) and (instring in words): return [instring] prefix, suffix = prefix + instring[0], instring[1:] solutions = [] # Case 1: prefix in solution if prefix in words: try: solutions.append([prefix] + find_words(suffix, '', words)) except ValueError: pass # Case 2: prefix not in solution try: solutions.append(find_words(suffix, prefix, words)) except ValueError: pass if solutions: return sorted(solutions, key = lambda solution: [len(word) for word in solution], reverse = True)[0] else: raise ValueError('no solution') print(find_words('tableapplechairtablecupboard')) print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))
收益率
['table', 'apple', 'chair', 'table', 'cupboard'] ['tab', 'leprechaun']

网友
3楼 · 发布于 2024-05-23 18:23:52

基于top answer中的出色工作，我创建了一个pip包，以便于使用。

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

要安装，请运行pip install wordninja。

唯一的区别很小。这将返回一个list，而不是str，它在python3中工作，它包括单词列表并正确拆分，即使存在非alpha字符（如下划线、破折号等）。

再次感谢普通人！

https://github.com/keredson/wordninja

这个想法

守则

结果

优化

相关问题更多 >

编程相关推荐

热门问题

热门文章