NLP清洗函数的矢量化形式

import spacy nlp = spacy.load("en") def clean(text): """ Text preprocessing for english text """ # Apply spacy to the text doc=nlp(text) # Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters) tokens=[token.lemma_.strip() for token in doc if not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords and not token.is_punct # Remove puntuaction and not token.is_digit # Remove digit ] # Recreation of the text text=" ".join(tokens) return text.lower()

1条回答

网友

1楼 · 发布于 2024-04-19 02:32:17

简短的回答

这类问题本身就需要时间。你知道吗

冗长的回答

使用正则表达式
更换空间管线

做决定所需的字符串信息越多，花费的时间就越长。你知道吗

好消息是，如果文本的清理相对简化，几个正则表达式就可以了。你知道吗

否则，您将使用空间管道来帮助删除文本位，这是非常昂贵的，因为默认情况下它会执行许多操作：

标记化
柠檬化
依赖关系分析
内尔
分块

或者，您可以再次尝试您的任务，并关闭您不想要的空间管道方面，这可能会加快它相当多。你知道吗

例如，可以关闭命名实体识别、标记和依赖项分析。。。你知道吗

nlp = spacy.load("en", disable=["parser", "tagger", "ner"])

然后再试一次，它会加速的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章