使用最大匹配的python中的简单泰语wordcut
pythaiwordcut的Python项目详细描述
python中的pythaiwordcut-泰语单词cut
用Python编写的一个简单的泰语分词,基于最大匹配算法 是的。Uses Lexitron(按{a3})字典作为默认
Please note: This project is under development and should not be use in production , all function and interface are subject to change. If you have issue or suggestion please feel free to ask, contribution is also very welcome :)
安装
pip install pythaiwordcut
或
git clone https://github.com/zenyai/pythaiwordcut.git python setup.py install
用法
importpythaiwordcutaspwtpt=pwt.wordcut(removeRepeat=True,stopDictionary="<full path to txt file>",removeSpaces=True,minLength=1,stopNumber=False,removeNonCharacter=False,caseSensitive=True,ngram=(1,2),negation=False)print"|".join(pt.segment(u'ทดสอบการตัดคำ'))
- removepeat:删除意图插入拼写错误,例如(_____)
- Removespaces:删除空白空间
- minlength:每个单词的最小长度
- removenoncharacter:删除不是泰语或英语字符的字符
- 区分大小写:如果设置为false,将删除停止字而不考虑大小写
- ngram:从(1,2)中添加单词ngram
- 否定:如果设置为true,则它将在否定词和空格后的每个单词中添加not_