易于与Flask应用程序集成的自动完成模型
markov_autocomplete的Python项目详细描述
马尔可夫自动完成
隐马尔可夫模型生成自动完成建议。
如何使用
这个模型可以用你自己的句子列表来训练。
例如,如果我们想使用《鲁滨逊漂流记》的前两段进行训练
from markov_autocomplete.autocomplete import Autocomplete
sentences = ["I WAS born in the year 1632, in the city of York, of a good family, though not of that country, my father being a foreigner of Bremen, who settled first at Hull. He got a good estate by merchandise, and leaving off his trade, lived afterwards at York, from whence he had married my mother, whose relations were named Robinson, a very good family in that country, and from whom I was called Robinson Kreutznaer; but, by the usual corruption of words in England, we are now called - nay we call ourselves and write our name - Crusoe; and so my companions always called me.", "I had two elder brothers, one of whom was lieutenant-colonel to an English regiment of foot in Flanders, formerly commanded by the famous Colonel Lockhart, and was killed at the battle near Dunkirk against the Spaniards. What became of my second brother I never knew, any more than my father or mother knew what became of me."]
ac = Autocomplete(model_path = "ngram", sentences = sentences, n_model=3, n_candidates=10, match_model="middle", min_freq=0, punctuations="", lowercase=True)
ac.predictions("country")
工作原理
给定一个输入字符串,该字符串由nwordswww1,…,wwn组成,该模型从语言模型中预测以下单词wwu1}。
<>>>>W{{n+1 }的最大可能候选是用极大值计算的。p(w{n+1}w{n,…,w{n-o+2})
其中,o是模型的顺序。
一旦计算出最佳候选,整个句子的概率近似为N-gram模型
。p(w{1,…,w{n,w{n+1})=生产(w{i{w{i-n-1},…,w{i-1})
例如,对于2克模型,我们有
p(w1,w2,w3,w4)=p(w1)p(w2 w1)p(w3 w2)p(w4 w3)
另一方面,对于3克模型,我们有
p(w1,w2,w3,w4)=p(w1)p(w2 w1)p(w3 w1,w2)p(w4 w2,w3)
高阶模型会更精确,但代价是生成大量的n-grams,这可能会对存储空间和计算时间产生负面影响。
如果输入字符串包含的单词少于模型的顺序,则自动完成程序将计算模型的同一顺序中最有可能的n-gram。