Python能否配置缓存sys.path目录查询？

12 投票

3 回答

1291 浏览

提问于 2025-04-18 16:55

我们最近在测试通过远程连接运行Python的性能。这个程序是在外地运行的，但它需要访问本地的磁盘。我们使用的是RHEL6系统。我们用strace工具观察了一个简单的程序。结果发现，它花了很多时间在检查文件是否存在上，比如执行stat和open操作。在远程连接的情况下，这样做是非常耗时的。有没有办法让Python只读取一次目录的内容，并把这个列表缓存起来，这样就不需要再检查了呢？

示例程序 test_import.py：

import random
import itertools

我运行了以下命令：

$ strace -Tf python test_import.py >& strace.out
$ grep '/usr/lib64/python2.6/' strace.out | wc
331    3160   35350

所以它大约在那个目录里查找了331次。很多次的结果都是这样的：

stat ( "/usr/lib64/python2.6/posixpath", 0x7fff1b447340 ) = -1 ENOENT ( No such file or directory ) < 0.000009 >

如果它能缓存这个目录，就不需要再去检查文件是否存在了。

性能优化文件检查缓存系统性能远程连接 strace工具 rhel6 目录查询

3 个回答

除了使用导入器或者zipimport，你还应该考虑将你的代码“冻结”。冻结代码可以大幅减少系统调用的次数。

关于Python的冻结，可以查看这个链接：https://wiki.python.org/moin/Freeze，还有第三方工具的信息在这里：http://cx-freeze.readthedocs.org/en/latest/

将一个简单的脚本冻结后，系统调用的次数从232减少到了88。

$ strace -c -e stat64,open python2 hello.py
hello
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000011           0       232       161 open
------ ----------- ----------- --------- --------- ----------------
100.00    0.000011                   232       161 total
$ strace -c -e stat64,open ./hello
hello
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  -nan    0.000000           0        88        73 open
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    88        73 total

不过，你的sys.path中的条目数量还是会影响到你，但这时使用importlib2和它的缓存功能可以帮到你。

回答于 2025-04-18 由 Python大师

分享举报

我知道这可能不是你想要的答案，但我还是来回答一下 :D

其实，sys.path 目录并没有缓存系统，不过 zipimport 会在 .zip 文件里创建一个模块的索引。这个索引可以让查找模块的速度更快。

不过，这个方法有个缺点，就是你不能用它来处理二进制模块（比如 .so 文件），因为 Python 用来加载这种模块的 dlopen() 不支持。

还有一个问题是，有些模块（比如你例子里的 posixpath）是在 CPython 解释器启动时就被加载的。

顺便说一句，希望你还记得我在 PythonBrasil 帮你装迪士尼/皮克斯纪念品的事 :D

回答于 2025-04-18 由 Python大师

分享举报

你可以通过升级到Python 3.3，或者用其他方式替代标准的导入系统来避免这个问题。在我两周前在PyOhio的演讲中，我提到了旧的导入机制的性能问题，这种性能是O(nm)，其中n是目录的数量，m是可能的后缀数量；你可以从这张幻灯片开始了解。

我演示了如何通过easy_install和一个基于Zope的网络框架，生成73,477个系统调用，仅仅是为了完成足够的导入以启动程序。

例如，在我的笔记本电脑上快速安装bottle后，我发现Python需要进行正好1,000次调用才能导入这个模块并正常运行：

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179           0      1519      1355 open
  0.00    0.000000           0       475       363 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179                  1994      1718 total

但是，如果我进入os.py，我可以添加一个缓存导入器，即使是非常简单的实现，也能将未命中的次数减少近一千次：

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041           0       699       581 open
  0.00    0.000000           0       301       189 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041                  1000       770 total

我选择os.py进行实验，因为strace显示它是Python导入的第一个模块，越早安装我们的导入器，Python在旧的慢速机制下需要导入的标准库模块就越少！

# Put this right below "del _names" in os.py

class CachingImporter(object):

    def __init__(self):
        self.directory_listings = {}

    def find_module(self, fullname, other_path=None):
        filename = fullname + '.py'
        for syspath in sys.path:
            listing = self.directory_listings.get(syspath, None)
            if listing is None:
                try:
                    listing = listdir(syspath)
                except OSError:
                    listing = []
                self.directory_listings[syspath] = listing
            if filename in listing:
                modpath = path.join(syspath, filename)
                return CachingLoader(modpath)

class CachingLoader(object):

    def __init__(self, modpath):
        self.modpath = modpath

    def load_module(self, fullname):
        if fullname in sys.modules:
            return sys.modules[fullname]
        import imp
        mod = imp.new_module(fullname)
        mod.__loader__ = self
        sys.modules[fullname] = mod
        mod.__file__ = self.modpath
        with file(self.modpath) as f:
            code = f.read()
        exec code in mod.__dict__
        return mod

sys.meta_path.append(CachingImporter())

当然，这个方法还有很多不足之处——它并没有尝试检测.pyc文件或.so文件，或者Python可能会寻找的其他扩展名。它也不知道__init__.py文件或包内的模块（这需要在sys.path条目的子目录中运行lsdir()）。但至少它说明了通过类似的方法可以消除成千上万的额外调用，并展示了你可以尝试的方向。当它找不到一个模块时，正常的导入机制会自动启动。

我在想，PyPI上是否已经有好的缓存导入器可用？这似乎是很多地方都已经写过的东西。我记得Noah Gift好像写过一个，并放在博客文章里，但我找不到确认我记忆的链接。

编辑：正如@ncoglan在评论中提到的，PyPI上有一个将新的Python 3.3+导入系统移植到Python 2.7的alpha版本：http://pypi.python.org/pypi/importlib2——不幸的是，看起来提问者仍然在使用2.6版本。

回答于 2025-04-18 由 Python大师

分享举报

Python能否配置缓存sys.path目录查询？

3 个回答

撰写回答