R或Python中的词形还原器(am, are, is -> be?)
1 个回答
6
这里有一种在R语言中实现的方法,使用的是西北大学的词形还原工具,叫做MorphAdorner。
lemmatize <- function(wordlist) {
get.lemma <- function(word, url) {
response <- GET(url,query=list(spelling=word,standardize="",
wordClass="",wordClass2="",
corpusConfig="ncf", # Nineteenth Century Fiction
media="xml"))
content <- content(response,type="text")
xml <- xmlInternalTreeParse(content)
return(xmlValue(xml["//lemma"][[1]]))
}
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
return(sapply(wordlist,get.lemma,url=url))
}
words <- c("is","am","was","are")
lemmatize(words)
# is am was are
# "be" "be" "be" "be"
我想你可能知道,正确的词形还原需要了解单词的类别(词性)、上下文中正确的拼写,而且还要看使用的是哪个语料库。