我如何估计文件的压缩性而不压缩它？

3条回答

网友

1楼 · 编辑于 2024-06-08 18:17:18

我想你要找的是How to calculate the entropy of a file?

这些问题包含了计算文件熵的各种方法（通过这些方法你可以得到文件的“可压缩性”）。这里引用了this文章摘要（熵与测试数据压缩 IEEE成员Kedarnath J.Balakrishnan和IEEE高级成员Nur A.Touba）：

The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory

更具建设性的是，checkoutthissite for python实现数据块的熵计算

网友

2楼 · 编辑于 2024-06-08 18:17:18

只要距离文件的中间10公里就可以了。您不需要开始或结束，因为它们可能包含不代表文件其余部分的头或尾信息。10K足够在任何典型算法中获得一定的压缩量。这将预测整个文件的相对压缩量，中间的10K具有代表性。您得到的绝对比率与整个文件的绝对比率不一样，但它与不压缩的不同量将允许您设置一个阈值。只需对许多文件进行试验，看看在哪里设置阈值。在

如前所述，您可以通过对明显已被压缩的文件（例如.png..）不做任何操作来节省时间。.zip.pdf.jp.g

测量熵不一定是个好指标，因为它只给出压缩性的零阶估计。如果熵表明它足够可压缩，那么它是正确的。如果熵表明它的可压缩性不够，那么它可能是正确的，也可能是不正确的。你的实际压缩机是一个更好的估计压缩性。在1K上运行不会花太长时间。在

网友

3楼 · 编辑于 2024-06-08 18:17:18

压缩文件通常不能很好地压缩。这意味着几乎任何媒体文件都不能很好地压缩，因为大多数媒体格式已经包含压缩。显然，这也有例外，比如BMP和TIFF图像，但是您可以构建一个经过良好压缩的文件类型的白名单（png、MPEGs，以及远离visualmedia-gzip、bzip2等）来跳过，然后假设您遇到的其他文件将被很好地压缩。在

如果您想变得花哨，可以在系统中构建反馈（观察您所做的任何压缩的结果，并将结果比率与文件类型相关联）。如果遇到压缩性能一直很差的文件类型，可以将其添加到白名单中。在

这些想法依赖于能够识别文件的类型，但是有一些标准的实用程序可以很好地完成这项工作（通常比80%好得多）——file（1），/etc/mime.类型等等

相关问题更多 >

编程相关推荐

热门问题

热门文章