法国特克斯

网友

1楼 · 编辑于 2024-05-13 10:36:10

Here是nltk开发人员的一个旧但相关的注释。看起来nltk中的大多数高级词干分析器都是特定于英语的：

The nltk.stem module currently contains 3 stemmers: the Porter stemmer, the Lancaster stemmer, and a Regular-Expression based stemmer. The Porter stemmer and Lancaster stemmer are both English- specific. The regular-expression based stemmer can be customized to use any regular expression you wish. So you should be able to write a simple stemmer for non-English languages using the regexp stemmer. For example, for french:
from nltk import stem
stemmer = stem.Regexp('s$|es$|era$|erez$|ions$| <etc> ')
But you'd need to come up with the language-specific regular expression yourself. For a more advanced stemmer, it would probably be necessary to add a new module. (This might be a good student project.)
For more information on the regexp stemmer:
http://nltk.org/doc/api/nltk.stem.regexp.Regexp-class.html
-Edward

注意：他提供的链接已失效，请参阅here以获取当前regexstemmer文档。

不过，最近添加的snowball stemmer似乎能够阻止法语。让我们来检验一下：

>>> from nltk.stem.snowball import FrenchStemmer
>>> stemmer = FrenchStemmer()
>>> stemmer.stem('voudrais')
u'voudr'
>>> stemmer.stem('animaux')
u'animal'
>>> stemmer.stem('yeux')
u'yeux'
>>> stemmer.stem('dors')
u'dor'
>>> stemmer.stem('couvre')
u'couvr'

如你所见，有些结果有点可疑。

不完全是你所希望的，但我想这是一个开始。

网友

2楼 · 编辑于 2024-05-13 10:36:10

我找到的最好的解决办法是痉挛，它似乎能起到作用

要安装：

pip3 install spacy
python3 -m spacy download fr_core_news_md

使用：

import spacy
nlp = spacy.load('fr_core_news_md')

doc = nlp(u"voudrais non animaux yeux dors couvre.")
for token in doc:
    print(token, token.lemma_)

结果：

voudrais vouloir
non non
animaux animal
yeux oeil
dors dor
couvre couvrir

查看文档了解更多详细信息：https://spacy.io/models/fr&；https://spacy.io/usage

网友

3楼 · 编辑于 2024-05-13 10:36:10

也许和特雷塔格在一起？我没有试过，但这个应用程序可以用法语运行

http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
http://txm.sourceforge.net/installtreetagger_fr.html

相关问题更多 >

编程相关推荐

热门问题

热门文章

法国特克斯

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >