在Python中快速删除停止字的方法

网友

1楼 · 编辑于 2024-04-19 21:06:16

首先，为每个字符串创建停止字。创建一次。这里的布景一定很棒。

forbidden_words = set(stopwords.words('english'))

稍后，去掉[]内部的join。改用发电机。

' '.join([x for x in ['a', 'b', 'c']])

替换为

' '.join(x for x in ['a', 'b', 'c'])

接下来要处理的是使.split()产生值，而不是返回数组。 ~~我相信regex在这里是很好的替代品。~~请参阅thist hread了解s.split()实际上速度很快的原因。

最后，并行地做这样的工作（删除6m字符串中的停止字）。这是一个完全不同的话题。

网友

2楼 · 编辑于 2024-04-19 21:06:16

使用regexp删除所有不匹配的单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

这可能比循环本身快得多，特别是对于大型输入字符串。

如果此操作删除了文本中的最后一个单词，则可能有尾随空白。我建议分开处理。

网友

3楼 · 编辑于 2024-04-19 21:06:16

尝试缓存stopwords对象，如下所示。每次调用函数时构造这个函数似乎是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我在profiler中运行了这个：python-m cProfile-s cumulative test.py。相关行张贴在下面。

nCalls累计时间

10000 7.723字。py:7（testFuncOld）

10000个0.140字。py:11（testFuncNew）

因此，缓存stopwords实例可以提高大约70倍的速度。

相关问题更多 >

编程相关推荐

热门问题

热门文章