Python：如何在文本中查找n-gram模式？

2 投票

3 回答

2013 浏览

提问于 2025-04-17 13:03

我有一个字符串，它的长度可能很长，比如说：

s = 'Choose from millions of possibilities on Shaadi.com. Create your profile, search&contact; your special one.RegisterFree\xa0\xa0\xa0unsubscribing reply to this mail\xa0\n and 09times and this is limited time offer! and this is For free so you are saving cash'

我还有一个垃圾词的列表，可能是这样的：

p_words = ['cash', 'for free', 'limited time offer']

我想知道在输入的文本中是否存在这些词，以及出现了多少次。

如果只有一个词，这个问题就简单了：

import re
p = re.compile(''.join[p_words])  # correct me if I am wrong here
m = p.match(s)

但它也可能是一个 二元组、三元组或n元组。

我们该怎么处理这个问题呢？

文本处理字符串匹配词频统计 n-gram 垃圾词过滤

3 个回答

正则表达式使用'|'这个符号来分隔不同的选项。你可以把每个选项中的空格替换成像'\W+'这样的东西，这个'\W+'可以匹配不是字母的字符。这样做应该就没问题了。

回答于 2025-04-17 由 Python大师

分享举报

p = re.compile('|'.join(re.escape(w) for w in p_words))

p 将会匹配 p_words 中的任何一个字符串。

回答于 2025-04-17 由 Python大师

分享举报

如果文本和单词数量不是很多，你可以先试试这个例子:

d = {w: s.count(w) for w in p_words if w in s}
# -> {'cash': 1, 'limited time offer': 1}

你可以把它的性能和下面这个进行比较:

import re
from collections import Counter

p = re.compile('|'.join(map(re.escape, p_words)))
d = Counter(p.findall(s))
# -> Counter({'limited time offer': 2, 'cash': 2})

作为参考，可以把它的速度和 fgrep 进行比较。它在输入流中匹配多个字符串时应该会很快:

$ grep -F -o -f  patternlist.txt largetextfile.txt  | sort | uniq -c

输出

  2 cash
  2 limited time offer

回答于 2025-04-17 由 Python大师

分享举报

Python：如何在文本中查找n-gram模式？

3 个回答

输出

撰写回答