The nltk.stem module currently contains 3 stemmers: the Porter
stemmer, the Lancaster stemmer, and a Regular-Expression based
stemmer. The Porter stemmer and Lancaster stemmer are both English-
specific. The regular-expression based stemmer can be customized to
use any regular expression you wish. So you should be able to write a
simple stemmer for non-English languages using the regexp stemmer.
For example, for french:
from nltk import stem
stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ')
But you'd need to come up with the language-specific regular
expression yourself. For a more advanced stemmer, it would probably
be necessary to add a new module. (This might be a good student
project.)
Here是nltk开发人员的一个旧但相关的注释。看起来nltk中的大多数高级词干分析器都是特定于英语的:
注意:他提供的链接已失效,请参阅here以获取当前regexstemmer文档。
不过,最近添加的snowball stemmer似乎能够阻止法语。让我们来检验一下:
如你所见,有些结果有点可疑。
不完全是你所希望的,但我想这是一个开始。
我找到的最好的解决办法是痉挛,它似乎能起到作用
要安装:
使用:
结果:
查看文档了解更多详细信息:https://spacy.io/models/fr&;https://spacy.io/usage
也许和特雷塔格在一起?我没有试过,但这个应用程序可以用法语运行
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
http://txm.sourceforge.net/installtreetagger_fr.html
相关问题 更多 >
编程相关推荐