如何预处理文本以删除停止字？

def read_text(text_path): text = [] with open(text_path) as file: lines = file.readlines() for index, line in enumerate(lines): text.append(simple_preprocess(remove_stopwords(line))) return text text = read_text('/content/text.txt') text = [x for x in text if x] text[:3]

[['clinical', 'guidelines', 'management', 'ibd'], ['polygenetic', 'risk', 'scores', 'add', 'predictive', 'power', 'clinical', 'models', 'response', 'anti', 'tnfα', 'therapy', 'inflammatory', 'bowel', 'disease'], ['anti', 'tumour', 'necrosis', 'factor', 'alpha', 'tnfα', 'therapy', 'widely', 'management', 'crohn', 'disease', 'cd', 'ulcerative', 'colitis', 'uc', 'however', 'patients', 'respond', 'induction', 'therapy', 'patients', 'lose', 'response', 'time', 'to', 'aid', 'patient', 'stratification', 'polygenetic', 'risk', 'scores', 'identified', 'predictors', 'response', 'anti', 'tnfα', 'therapy', 'we', 'aimed', 'replicate', 'association', 'polygenetic', 'risk', 'scores', 'response', 'anti', 'tnfα', 'therapy', 'independent', 'cohort', 'patients', 'establish', 'clinical', 'validity']]

1条回答

网友

1楼 · 发布于 2024-06-12 07:13:15

remove_stopwords（）函数区分大小写，不会忽略标点符号。例如，“However”不在STOPWORDS中，但“However”在。您应该首先调用simple_preprocess（）函数。这应该起作用：

from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopword_tokens

def read_text(text_path):
  text = []
  with open(text_path) as file:
    lines = file.readlines()
    for index, line in enumerate(lines):
      tokens = simple_preprocess(line)
      text.append(remove_stopword_tokens(tokens,stopwords=STOPWORDS))
  return text

相关问题更多 >

编程相关推荐

热门问题

热门文章