<p>我将用工作代码充实马克·阿德勒所说的话。他说的每一句话都是对的:-)要点:</p>
<ol>
<li>不能使用浮点权重或任何其他丢失权重信息的方案。使用整数。简单而正确。例如,如果你有3位数的浮动概率,通过<code>int(round(the_probability * 1000))</code>将每一个概率转换成一个整数,那么也许可以调整它们以确保总和正好是1000。在</li>
<li><code>heapq</code>堆不是“稳定的”:如果多个项具有相同的最小权重,则不会定义弹出哪个项。在</li>
<li>所以你在建树的时候就不能得到你想要的东西。在</li>
<li>“规范哈夫曼密码”的一个小变种似乎是你想要的。为此构建一棵树是一个冗长的过程,但每一步都足够简单。构建的第一棵树被丢弃:从中获取的唯一信息是分配给每个符号的代码的长度。在</li>
</ol>
<p>运行中:</p>
<pre><code>syms = ['a','b','c','d','e']
weights = [30, 25, 20, 15, 10]
t = encode(syms, weights)
print t
</code></pre>
<p>打印此文件(格式化以便于阅读):</p>
^{pr2}$
<p>据我所知,这正是你想要的。如果不是,就抱怨;-)</p>
<p><strong>编辑</strong>:在规范代码的赋值中有一个错误,除非权重非常不同,否则不会显示出来;修复了它。在</p>
<pre><code>class Node(object):
def __init__(self, data=None, weight=None,
left=None, right=None,
code=None):
self.data = data
self.weight = weight
self.left = left
self.right = right
self.code = code
def is_symbol(self):
return self.left is self.right is None
def __repr__(self):
return "[%s,%s,(%s),(%s)]" % (self.data,
self.code,
self.left,
self.right)
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def encode(syms, weights):
from heapq import heapify, heappush, heappop
tree = [Node(data=s, weight=w)
for s, w in zip(syms, weights)]
sym2node = {s.data: s for s in tree}
heapify(tree)
while len(tree) > 1:
a, b = heappop(tree), heappop(tree)
heappush(tree, Node(weight=a.weight + b.weight,
left=a, right=b))
# Compute code lengths for the canonical coding.
sym2len = {}
def assign_codelen(node, codelen):
if node is not None:
if node.is_symbol():
sym2len[node.data] = codelen
else:
assign_codelen(node.left, codelen + 1)
assign_codelen(node.right, codelen + 1)
assign_codelen(tree[0], 0)
# Create canonical codes, but with a twist: instead
# of ordering symbols alphabetically, order them by
# their position in the `syms` list.
# Construct a list of (codelen, index, symbol) triples.
# `index` breaks ties so that symbols with the same
# code length retain their original ordering.
triples = [(sym2len[name], i, name)
for i, name in enumerate(syms)]
code = oldcodelen = 0
for codelen, _, name in sorted(triples):
if codelen > oldcodelen:
code <<= (codelen - oldcodelen)
sym2node[name].code = format(code, "0%db" % codelen)
code += 1
oldcodelen = codelen
# Create a tree corresponding to the new codes.
tree = Node(code="")
dir2attr = {"0": "left", "1": "right"}
for snode in sym2node.values():
scode = snode.code
codesofar = ""
parent = tree
# Walk the tree creating any needed interior nodes.
for d in scode:
assert parent is not None
codesofar += d
attr = dir2attr[d]
child = getattr(parent, attr)
if codesofar == scode:
# We're at the leaf position.
assert child is None
setattr(parent, attr, snode)
elif child is not None:
assert child.code == codesofar
else:
child = Node(code=codesofar)
setattr(parent, attr, child)
parent = child
# Finally, paste the `data` attributes together up
# the tree. Why? Don't know ;-)
def paste(node):
if node is None:
return ""
elif node.is_symbol():
return node.data
else:
result = paste(node.left) + paste(node.right)
node.data = result
return result
paste(tree)
return tree
</code></pre>
<h2>重复符号</h2>
<blockquote>
<p>Could I swap the sym2node dict to an ordereddict to deal with
repeated 'a'/'b's etc?</p>
</blockquote>
<p>不,有两个原因:</p>
<ol>
<li>没有映射类型支持重复键;并且</li>
<li>“重复符号”的概念对哈夫曼编码毫无意义。在</li>
</ol>
<p>所以,如果你下定决心要追求这个目标,首先你必须确保符号是唯一的。只需在函数的开头添加以下行:</p>
<pre><code> syms = list(enumerate(syms))
</code></pre>
<p>例如,如果传入的<code>syms</code>是:</p>
<pre><code>['a', 'b', 'a']
</code></pre>
<p>将更改为:</p>
<pre><code>[(0, 'a'), (1, 'b'), (2, 'a')]
</code></pre>
<p>现在所有的符号都是2元组,而且显然是唯一的,因为每个符号都以唯一的整数开头。<em>算法关心的唯一问题是符号可以用作dict键;它不关心它们是字符串、元组还是任何其他支持等式测试的散列类型。在</p>
<p>所以算法中没有什么需要改变的。但在结束之前,我们要恢复原来的符号。只需在<code>paste()</code>函数之前插入:</p>
<pre><code> def restore_syms(node):
if node is None:
return
elif node.is_symbol():
node.data = node.data[1]
else:
restore_syms(node.left)
restore_syms(node.right)
restore_syms(tree)
</code></pre>
<p>它只需遍历树并从符号的<code>.data</code>成员中去掉前导整数。或者,也许更简单,只需迭代<code>sym2node.values()</code>,并转换每个成员的<code>.data</code>成员。在</p>