易于与Flask应用程序集成的自动完成模型

markov_autocomplete的Python项目详细描述


马尔可夫自动完成

隐马尔可夫模型生成自动完成建议。

如何使用

这个模型可以用你自己的句子列表来训练。

例如,如果我们想使用《鲁滨逊漂流记》的前两段进行训练

from markov_autocomplete.autocomplete import Autocomplete

sentences = ["I WAS born in the year 1632, in the city of York, of a good family, though not of that country, my father being a foreigner of Bremen, who settled first at Hull. He got a good estate by merchandise, and leaving off his trade, lived afterwards at York, from whence he had married my mother, whose relations were named Robinson, a very good family in that country, and from whom I was called Robinson Kreutznaer; but, by the usual corruption of words in England, we are now called - nay we call ourselves and write our name - Crusoe; and so my companions always called me.", "I had two elder brothers, one of whom was lieutenant-colonel to an English regiment of foot in Flanders, formerly commanded by the famous Colonel Lockhart, and was killed at the battle near Dunkirk against the Spaniards. What became of my second brother I never knew, any more than my father or mother knew what became of me."]

ac = Autocomplete(model_path = "ngram", sentences = sentences, n_model=3, n_candidates=10, match_model="middle", min_freq=0, punctuations="", lowercase=True)

ac.predictions("country")

工作原理

给定一个输入字符串,该字符串由nwordswww1,…,wwn组成,该模型从语言模型中预测以下单词wwu1}

<>>>>W{{n+1 }的最大可能候选是用极大值

计算的。

p(w{n+1}w{n,…,w{n-o+2})

其中,o是模型的顺序。

一旦计算出最佳候选,整个句子的概率近似为N-gram模型

p(w{1,…,w{n,w{n+1})=生产(w{i{w{i-n-1},…,w{i-1})

例如,对于2克模型,我们有

p(w1,w2,w3,w4)=p(w1)p(w2 w1)p(w3 w2)p(w4 w3)

另一方面,对于3克模型,我们有

p(w1,w2,w3,w4)=p(w1)p(w2 w1)p(w3 w1,w2)p(w4 w2,w3)

高阶模型会更精确,但代价是生成大量的n-grams,这可能会对存储空间和计算时间产生负面影响。

如果输入字符串包含的单词少于模型的顺序,则自动完成程序将计算模型的同一顺序中最有可能的n-gram。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java Clojure关键字在内存中的大小是多少?   Java中有固定长度的通用数组对象吗?   PostgreSQL:通过Java更新我的用户表   错误:使用java解析xml   java Json显示列表中对象的名称   java比较JodaTime时区   与JAVA中的API和包的区别?   java的int值在for循环中不改变   谷歌应用引擎中的java RSA   迁移到spring 5后出现java非法字符错误   java Websphere管理控制台不工作   JavaGSON如何始终在json中包含毫秒?   带有空格和双引号的windows Java ProcessBuilder命令参数失败   java错误:重复的zip条目[43.jar:org/apache/http/annotation/NotThreadSafe.class]