使用空间删除停止语

2024-05-14 23:37:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在清理我的data frame,sumption中的一个列,并尝试做3件事:

  1. 标记化
  2. 柠檬汁
  3. 删除停止语

    import spacy        
    nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
    df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
    spacy_stopwords.add('attach')
    df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))
    

但是,例如,当我打印时:

^{pr2}$

输出中仍有attach一词: attach poster on the wall because it is cool

为什么不删除停止字?在

我也试过了:

df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])

但是str attach仍然出现。在


Tags: lambdaintokenfalsedfnlpspacyen
1条回答
网友
1楼 · 发布于 2024-05-14 23:37:02
import spacy
import pandas as pd

# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)        

# New stop words list 
customize_stop_words = [
    'attach'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True


# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                   "eating and sleeping"]})

# Convert each row into spacy document and return the lemma of the tokens in 
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop))

print (df)

输出:

^{pr2}$

相关问题 更多 >

    热门问题