如何计算单词出现次数而不被限制为仅精确匹配

2024-06-02 09:09:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文件的内容如下。你知道吗

Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response

我有一个python脚本,它计算一个特定单词在文件中出现的次数。下面是脚本。你知道吗

#!/usr/bin/env python

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

我希望输出是4,因为Hello出现了4次。但我得到的结果是2?下面是脚本的输出

Number of words in /path/to/file.txt is: 2

我有点理解Hello;不被认为是Hello,因为这个词不是搜索到的那个词。你知道吗

问题:
有没有一种方法可以让我的脚本选择Hello,即使它后面有逗号、分号或点?一些不需要在找到的单词中再次查找子字符串的简单技术。你知道吗


Tags: 文件ofin脚本numberhelloforresponse
3条回答

您可以使用“集合”模块中的regex和Counter:

txt = '''Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response'''

import re
from collections import Counter
from pprint import pprint

c = Counter()
re.sub(r'\b\w+\b', lambda r: c.update((r.group(0), )), txt)
pprint(c)

印刷品:

Counter({'Someone': 4,
         'Hello': 4,
         'again': 2,
         'said': 2,
         'response': 2,
         'says': 1,
         'responded': 1,
         'back': 1,
         'No': 1,
         'waiting': 1,
         'for': 1})

你可以用正则表达式来找到答案。你知道吗

import re
filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"


with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            b = re.search(r'\bHello;?\b', i)
            if b:
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

这将检查文件中是否有“Hello”或“Hello;”。您可以扩展regex以满足任何其他需要(例如小写)。你知道吗

它将忽略诸如“Helloing”之类的内容,这里的其他示例可能会忽略这些内容。你知道吗

如果你不想用正则表达式。。。您可以检查去掉最后一个字母是否匹配如下:

filename = "/path/to/file.txt"

number_of_words = 0
search_string = "Hello"

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        for i in words:
            if (i == search_string) or (i[:-1] == search_string and i[-1] == ';'):
                number_of_words += 1

print("Number of words in " + filename + " is: " + str(number_of_words))

正则表达式将是一个更好的工具,因为你想忽略标点符号。它可以通过巧妙的过滤和.count()方法来完成,但这更简单:

import re
...
search_string = "Hello"
with open(filename, 'r') as file:
    filetext = file.read()
occurrences = len(re.findall(search_string, filetext))

print("Number of words in " + filename + " is: " + str(occurrences))

如果希望不区分大小写,可以相应地更改search_string

search_string = r"[Hh]ello"

或者,如果要显式地使用单词Hello,而不是aHelloHellon,则可以在前后匹配\b字符(有关更多有趣的技巧,请参见the documentation):

search_string = r"\bHello\b"

相关问题 更多 >