Python，遍历文件夹中的文件并进行词频统计

3 投票

7 回答

5543 浏览

提问于 2025-04-17 12:02

我刚开始学Python，想写一个脚本来统计一个文件夹里所有txt文件的单词总数。到目前为止，我写的代码是这样的，单独打开一个txt文件时else部分能正常工作，但当我输入一个文件夹时就出错了。我知道我需要在某个地方加一个append，但我试了几种方法都没成功。

*补充说明：我希望结果能合并在一起。目前得到的是两个单独的结果。我试着新建一个列表，然后用计数器把结果加进去，但这样就出错了。再次感谢，这里真是个好社区。

import re
import os
import sys
import os.path
import fnmatch
import collections

def search( file ):

    if os.path.isdir(path) == True:
        for root, dirs, files in os.walk(path):
            for file in files:
                words = re.findall('\w+', open(file).read().lower())
                ignore = ['the','a','if','in','it','of','or','on','and','to']
                counter=collections.Counter(x for x in words if x not in ignore)
                print(counter.most_common(10))

    else:
        words = re.findall('\w+', open(path).read().lower())
        ignore = ['the','a','if','in','it','of','or','on','and','to']
        counter=collections.Counter(x for x in words if x not in ignore)
        print(counter.most_common(10))

path = input("Enter file and path, place ' before and after the file path: ")
search(path)

raw_input("Press enter to close: ")

文件操作文本处理文件遍历错误调试列表合并词频统计

7 个回答

看起来这个函数定义的参数有点问题。应该是：

def search(path):

这里的 ignore 是对的，但如果用集合（set）来代替列表（list），会更快：

ignore = set(['the','a','if','in','it','of','or','on','and','to'])

除此之外，这段代码看起来不错 :-)

回答于 2025-04-17 由 Python大师

分享举报

当你使用 os.walk 来遍历结果时，file 只会包含文件名，而不会包含它所在的目录。你需要把目录名和文件名合起来：

for root, dirs, files in os.walk(path):
    for name in files:
        file_path = os.path.join(root, name)
        #do processing on file_path here

我建议把处理文件的代码放到一个单独的函数里，这样你就不需要写两遍，而且更容易找到问题所在。

回答于 2025-04-17 由 Python大师

分享举报

把第14行改成：

words = re.findall('\w+', open(os.path.join(root, file)).read().lower())

另外，如果你把输入的那一行换成

path = raw_input("Enter file and path")

那么你就不需要在路径前后加上'了

回答于 2025-04-17 由 Python大师

分享举报

Python，遍历文件夹中的文件并进行词频统计

7 个回答

撰写回答