使用正则表达式标记化的NLP词干分析和柠檬化

2024-05-16 03:32:35 发布

您现在位置:Python中文网/ 问答频道 /正文

定义一个名为performStemAndLemma的函数,它接受一个参数。第一个参数textcontent是一个字符串。编辑器中给出了函数定义代码存根。执行以下指定任务:

1.标记textcontent中给出的所有单词。单词应该包含字母、数字或下划线。将标记化的单词列表存储在tokenizedwords中。(提示:使用regexp\u标记化)

  1. 将所有单词转换成小写。将结果存储到变量tokenizedwords

  2. 从唯一的tokenizedwords集中删除所有停止字。将结果存储到变量filteredwords中。(提示:使用stopwords语料库)

  3. 用PorterStemmer将filteredwords中出现的每个单词加上词干,并将结果存储在列表porterstemmedwords

  4. 使用Lancaster词干分析器对filteredwords中的每个单词进行词干分析,并将结果存储在列表lancasterstemmedwords

  5. 使用WordNetLemmatizer对filteredwords中的每个单词进行线性化,并将结果存储在列表lemmatizedwords

从函数返回porterstemmedwordslancasterstemmedwordslemmatizedwords变量

我的代码:

from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
    # Write your code here
    #Step 1
    tokenizedword = nltk.tokenize.regexp_tokenize(textcontent, pattern = '\w*', gaps = False)
    #Step 2
    tokenizedwords = [x.lower() for x in tokenizedword if x != '']
    #Step 3
    unique_tokenizedwords = set(tokenizedwords)
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in unique_tokenizedwords:
        if x not in stop_words:
            filteredwords.append(x)
    #Steps 4, 5 , 6
    ps = nltk.stem.PorterStemmer()
    ls = nltk.stem.LancasterStemmer()
    wnl = nltk.stem.WordNetLemmatizer()
    porterstemmedwords =[]
    lancasterstemmedwords = []
    lemmatizedwords = []
    for x in filteredwords:
        porterstemmedwords.append(ps.stem(x))
        lancasterstemmedwords.append(ls.stem(x))
        lemmatizedwords.append(wnl.lemmatize(x))
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords

但该计划仍不能很好地发挥作用。未通过2个测试用例。突出显示上述代码中的错误,并提供相同的替代解决方案


Tags: 函数代码in标记列表单词nltkstem
3条回答
def performStemAndLemma(textcontent):
    from nltk.corpus import stopwords

在如上定义函数之后,只需导入stopwords。代码的其余部分保持不变

实际上,预期的输出是把大写和小写的单词看作分开的令牌。因此,在将所有单词转换为小写之前,应该先获取所有唯一的单词。我希望下面的代码可以工作


from nltk.corpus import stopwords
def performStemAndLemma(textcontent):
    # Write your code here
    #Step 1
    tokenizedword = nltk.regexp_tokenize(textcontent, pattern = r'\w*', gaps = False)
    #Step 2
    tokenizedwords = [y for y in tokenizedword if y != '']
    unique_tokenizedwords = set(tokenizedwords)
    tokenizedwords = [x.lower() for x in unique_tokenizedwords if x != '']
    #Step 3
    #unique_tokenizedwords = set(tokenizedwords)
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in tokenizedwords:
        if x not in stop_words:
            filteredwords.append(x)
    #Steps 4, 5 , 6
    ps = nltk.stem.PorterStemmer()
    ls = nltk.stem.LancasterStemmer()
    wnl = nltk.stem.WordNetLemmatizer()
    porterstemmedwords =[]
    lancasterstemmedwords = []
    lemmatizedwords = []
    for x in filteredwords:
        porterstemmedwords.append(ps.stem(x))
        lancasterstemmedwords.append(ls.stem(x))
        lemmatizedwords.append(wnl.lemmatize(x))
    return porterstemmedwords, lancasterstemmedwords, lemmatizedwords

下面的方法为我清除了所有的测试用例

import re
from nltk.corpus import stopwords 
def performStemAndLemma(textcontent):
    # Write your code here
    lancaster = nltk.LancasterStemmer()
    porter = nltk.PorterStemmer()
    wnl = nltk.WordNetLemmatizer()
    tokens2_3 = nltk.regexp_tokenize(textcontent,  r'\w+')
    stop_words = set(stopwords.words('english'))
    tokenisedwords=[words for words in set(tokens2_3) if not words.lower() in  stop_words ]
    #print(tokenizedwords)
    return [porter.stem(word.lower()) for word in set(tokenisedwords)],[lancaster.stem(word.lower()) for word in set(tokenisedwords)],[wnl.lemmatize(word.lower()) for word in set(tokenisedwords)]

相关问题 更多 >