Python - 单词出现次数

0 投票

5 回答

6779 浏览

提问于 2025-04-17 09:36

我正在尝试写一个函数，用来找出文本中某个（完整）单词出现的次数（不区分大小写）。

举个例子：

>>> text = """Antoine is my name and I like python.
Oh ! your name is antoine? And you like Python!
Yes is is true, I like PYTHON
and his name__ is John O'connor"""

assert( 2 == Occs("Antoine", text) )
assert( 2 == Occs("ANTOINE", text) )
assert( 0 == Occs("antoin", text) )
assert( 1 == Occs("true", text) )    
assert( 0 == Occs("connor", text) )
assert( 1 == Occs("you like Python", text) )
assert( 1 == Occs("Name", text) )

这是我一个简单的尝试：

def Occs(word,text):
    return text.lower().count(word.lower())

这个方法不行，因为它不是基于单词的。
这个函数必须要快，因为文本可能会非常大。

我应该把它分割成一个数组吗？
有没有简单的方法来实现这个函数？

编辑（python 2.3.4）

文本处理字符串分割函数优化文本分析大小写不敏感单词计数

5 个回答

谢谢你的帮助。
这是我的解决方案：

import re

starte = "(?<![a-z])((?<!')|(?<=''))"
ende = "(?![a-z])((?!')|(?=''))"

def NumberOfOccurencesOfWordInText(word, text):
    """Returns the nb. of occurences of whole word(s) (case insensitive) in a text"""
    pattern = (re.match('[a-z]', word, re.I) != None) * starte\
              + word\
              + (re.match('[a-z]', word[-1], re.I) != None) * ende
    return  len(re.findall(pattern, text, re.IGNORECASE))

回答于 2025-04-17 由 Python大师

分享举报

这里有一种不太符合Python风格的方法——我猜你这是个作业问题...

def count(word, text):
    result = 0
    text = text.lower()
    word = word.lower()
    index = text.find(word, 0)
    while index >= 0:
        result += 1
        index = text.find(word, index)
    return result

当然，对于非常大的文件，这种方法会比较慢，主要是因为调用了 text.lower()。不过你总是可以想出一种不区分大小写的 find 方法来解决这个问题！

我为什么要这样做呢？因为我觉得这样最能体现你想要做的事情：遍历 text，计算里面有多少次出现 word。

另外，这种方法还解决了一些关于标点符号的麻烦问题：split 会把标点符号留在里面，这样你就无法匹配了，对吧？

回答于 2025-04-17 由 Python大师

分享举报

from collections import Counter
import re

Counter(re.findall(r"\w+", text))

Counter(w.lower() for w in re.findall(r"\w+", text))

freq = defaultdict(int)
for w in re.findall(r"\w+", text):
    freq[w.lower()] += 1

或者，对于不区分大小写的版本

在Python 2.7之前，使用 defaultdict 代替 Counter：

回答于 2025-04-17 由 Python大师

分享举报

Python - 单词出现次数

5 个回答

撰写回答