<p><a href="http://osdir.com/ml/python.nltk.devel/2007-06/msg00018.html" rel="noreferrer"><strong>Here</strong></a>是nltk开发人员的一个旧但相关的注释。看起来nltk中的大多数高级词干分析器都是特定于英语的:</p>
<blockquote>
<p>The nltk.stem module currently contains 3 stemmers: the Porter
stemmer, the Lancaster stemmer, and a Regular-Expression based
stemmer. The Porter stemmer and Lancaster stemmer are both English-
specific. The regular-expression based stemmer can be customized to
use any regular expression you wish. So you should be able to write a
simple stemmer for non-English languages using the regexp stemmer.
For example, for french:</p>
<pre><code>from nltk import stem
stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ')
</code></pre>
<p>But you'd need to come up with the language-specific regular
expression yourself. For a more advanced stemmer, it would probably
be necessary to add a new module. (This might be a good student
project.)</p>
<p>For more information on the regexp stemmer:</p>
<p><a href="http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html" rel="noreferrer">http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html</a></p>
<p>-Edward</p>
</blockquote>
<p>注意:他提供的链接已失效,请参阅<a href="http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.regexp" rel="noreferrer"><strong>here</strong></a>以获取当前regexstemmer文档。</p>
<p>不过,最近添加的<a href="http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball" rel="noreferrer"><strong>snowball stemmer</strong></a>似乎能够阻止法语。让我们来检验一下:</p>
<pre><code>>>> from nltk.stem.snowball import FrenchStemmer
>>> stemmer = FrenchStemmer()
>>> stemmer.stem('voudrais')
u'voudr'
>>> stemmer.stem('animaux')
u'animal'
>>> stemmer.stem('yeux')
u'yeux'
>>> stemmer.stem('dors')
u'dor'
>>> stemmer.stem('couvre')
u'couvr'
</code></pre>
<p>如你所见,有些结果有点可疑。</p>
<p>不完全是你所希望的,但我想这是一个开始。</p>