加速os.walk搜索特定文件类型的代码

0 投票

2 回答

578 浏览

提问于 2025-04-18 18:20

我有一个函数，它可以遍历一个文件夹及其子文件夹，寻找特定类型的文件，这个功能本身没问题，但速度有点慢。有没有人能给我一些更“Pythonic”的建议，帮助我加快这个过程呢？

def findbyfiletype (filetype, directory):
"""

    findbyfiletype allows the user to search by two parameters, filetype and directory.

    Example:
        If the user wishes to locate all pdf files with a directory including subdirectories
        then the function would be called as follows:

        findbyfiletype(".pdf", "D:\\\\")

        this will return a dictionary of strings where the filename is the key and the file path is the value
        e.g.
            {'file.pdf':'c:\\folder\\file.pdf'}


        note that both parameters filetype and directory must be enclosed in string double or single quotes
        and the directory parameter must use the backslash escape \\\\  as opposed to \ as python will throw a string literal error
"""

indexlist =[]                       #holds all files in the given directory including sub folders
FiletypeFilenameList =[]            #holds list of all filenames of defined filetype in indexlist
FiletypePathList = []               #holds path names to indvidual files of defined filetype

for root, dirs, files in os.walk(directory):
    for name in files:
        indexlist.append(os.path.join(root,name))
        if filetype in name[-5:]:
            FiletypeFilenameList.append(name)

for files in indexlist:
    if filetype in files[-5:]:
        FiletypePathList.append(files)

FileDictionary=dict(zip(FiletypeFilenameList, FiletypePathList))
del indexlist, FiletypePathList, FiletypeFilenameList

return FileDictionary

好吧，这就是我最后得到的结果，结合了@Ulrich Eckhardt、@Anton和@Cox的建议。

import os
import scandir

def findbyfiletype (filetype, directory):
    FileDictionary={}

    for root, dirs, files in scandir.walk(directory):
        for name in files:
            if filetype in name and name.endswith(filetype):
                FileDictionary.update({name:os.path.join(root,name)})

return FileDictionary

如你所见，我对代码进行了重构，去掉了一些不必要的列表，并一步到位地创建了字典。@Anton，你提到的scandir模块真的帮了我大忙，让我在某个情况下速度提高了大约97%，这几乎让我惊呆了。

我把@Anton列为接受的答案，因为它总结了我通过重构所取得的成果，但@Ulrich Eckhardt和@Cox也都得到了点赞，因为你们都提供了很大的帮助。

祝好

重构代码优化数据结构文件搜索性能提升文件遍历文件类型 scandir模块

2 个回答

walk() 这个函数可能会比较慢，因为它试图处理很多事情。

我用了一种简单的变体：

def walk(self, path):
    try:
        l = (os.path.join(path, x) for x in os.listdir(path))
        for x in l:
            if os.path.isdir(x):self.walk(x)
            elif x.endswith(("jpg", "png", "jpeg")):
                self.lf.append(x)
    except PermissionError:pass

这样运行起来很快，而且 Python 会对文件系统进行本地缓存，所以第二次调用的时候会更快。

补充说明：这个 walk 函数是一个类的成员，所以你会看到“self”这个词。

编辑：在 NTFS 文件系统中，不用管 islink。可以用 try/except 来更新。

不过这样做只是忽略了你没有权限访问的文件夹。如果你想列出这些文件夹，你需要以管理员身份运行脚本。

回答于 2025-04-18 由 Python大师

分享举报

与其使用 os.walk()，不如试试更快的 scandir 模块（PEP-471）。

还有一些其他的小建议：

不要随便用 [-5:] 这种写法。可以用 endswith() 字符串方法，或者用 os.path.splitext()。
不要先创建两个很长的列表再去做字典，直接创建字典就可以了。
如果反斜杠让你觉得麻烦，可以用正斜杠，比如 'c:/folder/file.pdf'。这样也能正常工作。

回答于 2025-04-18 由 Python大师

分享举报

加速os.walk搜索特定文件类型的代码

2 个回答

撰写回答