Python dirhash包_程序模块 - PyPI

用于散列文件系统目录的python模块和cli。

dirhash的Python项目详细描述

dirhash

一个轻量级的python模块和工具，用于计算任何基于文件结构和内容的目录。

支持python内置模块的任何哈希算法hashlib
.gitignore样式的“wildmatch”模式，用于对包括/排除。
最多可进行6x speed-up

安装

git clone git@github.com:andhus/dirhash.git
pip install dirhash/

用法

python模块：

fromdirhashimportdirhashdirpath='path/to/directory'dir_md5=dirhash(dirpath,'md5')filtered_sha1=dirhash(dirpath,'sha1',ignore=['.*','.*/','*.pyc'])pyfiles_sha3_512=dirhash(dirpath,'sha3_512',match=['*.py'])

客户端：

dirhash path/to/directory -a md5
dirhash path/to/directory -a sha1 -i ".*  .*/  *.pyc"
dirhash path/to/directory -a sha3_512 -m "*.py"

为什么？

如果您（或您的应用程序）还需要验证一组文件的完整性作为他们的名字和位置，你可能会发现这很有用。用例范围从验证图像分类数据集（在花费GPU-$$之前培训您的高级深入学习模型）以验证回归测试。

这并不是一个标准的方法。有很多食谱在这里（参见这些SO问题以及python）但我找不到一个经过适当测试的（有一些问题需要掩盖！）并以引人注目的用户界面记录下来。dirhash的创建方式如下目标。

checksumdir是另一条Python 具有类似意图的模块/工具（激发了这个项目），但是它缺少此处提供的功能（最显著的是在散列中包含文件名/结构）而且缺乏测试。

性能

常用哈希算法的pythonhashlib实现高度优化。dirhash主要解析文件树，将数据传递到hashlib和合并输出已经采取了合理的措施来减少间接费用对于常见的用例，大部分时间都花在从磁盘读取数据上执行hashlib代码。

提高性能的主要工作是支持多处理，其中读取和散列是在单个文件上并行的。

作为参考，让我们比较一下dirhashCLI 使用shell命令：

find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5

这是SO问题的最高答案： Linux: compute a single hash for a given folder & contents? 两个测试用例的结果如下所示。两者都有1 gib的随机数据：in “flat_1k_1MB”，在一个平面结构中分成1k个文件（每个文件1 MIB），并在 “nested_32k_32kb”，分成32k个文件（每个文件32 kib），分布在256个叶目录上在深度为8的二叉树中。

Implementation	Test Case	Time (s)	Speed up
shell reference	flat_1k_1MB	2.29	-> 1.0
^{}	flat_1k_1MB	1.67	1.36
^{}(8 workers)	flat_1k_1MB	0.48	4.73
shell reference	nested_32k_32kB	6.82	-> 1.0
^{}	nested_32k_32kB	3.43	2.00
^{}(8 workers)	nested_32k_32kB	1.14	6.00

基准测试运行的是MacBookPro（2018），更多细节和源代码here。

文件

请参考dirhash -h和python source code。

欢迎加入QQ群-->： 979659372

dirhash 0.1.1

dirhash的Python项目详细描述

dirhash

安装

用法

为什么？

性能

文件

推荐PyPI第三方库

getube

distributions-poon

NlpToolkit-FrameNet-C

bonito-cuda-runtime

pyrandx

Geccoi

R2T2

djangoredissessionsfork

watsor

ffbinaries-api-client

laguerre-transformations

micropython-ds1631

imagebox

tensorboardpluginprofile

radical.facts

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

dirhash 0.1.1

dirhash的Python项目详细描述

dirhash

安装

用法

为什么？

性能

文件

推荐PyPI第三方库

getube

distributions-poon

NlpToolkit-FrameNet-C

bonito-cuda-runtime

pyrandx

Geccoi

R2T2

djangoredissessionsfork

watsor

ffbinaries-api-client

laguerre-transformations

micropython-ds1631

imagebox

tensorboardpluginprofile

radical.facts

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签