为什么在使用spacy进行词干/柠檬化时不能得到一致的结果？

2条回答

网友

1楼 · 编辑于 2024-04-24 22:00:11

我认为三段论解释得更好。但还有另一种方法：

from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower().split(';')
line = [a.strip().split(' ') for a in line]
line = [map(lambda x: lemma.lemmatize(x), l1) for l1 in line ]
print line

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-04-24 22:00:11

你用的是什么版本？使用lower它对我来说工作正常：

>>> doc = nlp(u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower())
>>> for word in doc:
...   print(word.text, word.lemma_, word.tag_)
... 
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'deterministic', u'deterministic', u'JJ')
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'adaptive', u'adaptive', u'JJ')
(u'algorithms', u'algorithm', u'NN')
(u';', u';', u':')
(u'something', u'something', u'NN')
(u'...', u'...', u'.')

如果没有lower，则标记者将Algorithms指定给标记NNP，即专有名词。这就防止了词缀化，因为模型已经从统计学上猜测了这个词是一个专有名词。在

如果愿意，可以在标记器中设置一个特殊的大小写规则，告诉spaCy Algorithms从来不是专有名词。在

^{pr2}$

tokenizer.add_special_case函数允许您指定如何将字符串标记化，并在每个副标题上设置属性。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

为什么在使用spacy进行词干/柠檬化时不能得到一致的结果？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >