从.txt文件中读取单词，并统计每个单词的数量

1 投票

3 回答

3008 浏览

提问于 2025-04-16 14:32

我在想，怎么像用fscanf那样读取字符字符串。我需要在所有的.txt文件中读取单词，我需要统计每个单词的数量。

collectwords = collections.defaultdict(int)

with open('DatoSO.txt', 'r') as filetxt:

for line in filetxt:
    v=""
    for char in line:
        if str(char) != " ":
          v=v+str(char)

        elif str(char) == " ":
          collectwords[v] += 1
          v=""

这样的话，我就无法读取最后一个单词了。

字符串处理数据分析单词统计文本标签: 文件读取

3 个回答

Python让这件事变得简单：

collectwords = []
filetxt = open('DatoSO.txt', 'r')

for line in filetxt:
  collectwords.extend(line.split())

回答于 2025-04-16 由 Python大师

分享举报

嗯，这样吗？

with open('DatoSO.txt', 'r') as filetxt:
    for line in filetxt:
        for word in line.split():
            collectwords[word] += 1

回答于 2025-04-16 由 Python大师

分享举报

如果你使用的是Python 2.7及以上版本，可以考虑使用 collections.counter。

这里有相关文档

这个工具增加了一些方法，比如'most_common'，在这种应用场景下可能会很有用。

来自Doug Hellmann的PyMOTW：

import collections

c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
    for line in f:
        c.update(line.rstrip().lower())

print 'Most common:'
for letter, count in c.most_common(3):
    print '%s: %7d' % (letter, count)

这个链接 -- 不过它是用来统计字母的，而不是单词的。在 c.update 这一行，你需要把 line.rstrip().lower 替换成 line.split()，并可能需要加一些代码来去掉标点符号。

编辑： 要去掉标点符号，这里有一个可能是最快的解决方案：

import collections
import string

c = collections.Counter()
with open('DataSO.txt', 'rt') as f:
    for line in f:
        c.update(line.translate(string.maketrans("",""), string.punctuation).split())

(这个方法借鉴自这个问题在Python中去掉字符串标点符号的最佳方法)

回答于 2025-04-16 由 Python大师

分享举报

从.txt文件中读取单词，并统计每个单词的数量

3 个回答

撰写回答