将列表映射到Huffman树，同时保留相对ord问题的回答

将列表映射到Huffman树，同时保留相对ord

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我对Huffman树上的搜索算法有一个问题：对于给定的概率分布，我需要Huffman树是相同的，而不考虑输入数据的排列。在 下面是一张正在发生的事情与我想要的结果的对比图： <img src="https://i.stack.imgur.com/t4G7u.jpg" alt="Expectation vs reality"/> 基本上，我想知道是否可以保留从列表到树的项目的相对顺序。如果没有，为什么会这样？在 作为参考，我使用Huffman树根据概率的划分生成子组，以便我可以运行下面的search（）过程。请注意，merge（）子例程中的数据与权重一起组合。码字本身没有树那么重要（树应该保持相对顺序）。在 例如，如果我生成以下哈夫曼代码： <pre><code>probabilities = [0.30, 0.25, 0.20, 0.15, 0.10] items = ['a','b','c','d','e'] items = zip(items, probabilities) t = encode(items) d,l = hi.search(t) print(d) </code></pre> 使用以下类： ^{pr2}$ 我得到： <pre><code>'a'->11 'b'->01 'c'->00 'd'->101 'e'->100 </code></pre> 然而，我在搜索算法中做的一个假设是，更多可能的项目会被推到左边：也就是说，我需要'a'来包含'00'码字-不管'abcde'序列的任何排列，都应该是这样。输出示例如下： <pre><code>codewords = {'a':'00', 'b':'01', 'c':'10', 'd':'110', 'e':111'} </code></pre> （注意，即使“c”的代码字是“d”的后缀，这也没问题）。在 为了完整起见，以下是搜索算法： <pre><code>def search(tree): print(tree) pdb.set_trace() current = tree.left other = tree.right loops = 0 while current: loops+=1 print(current) if current.data != 0 and current is not None and other is not None: previous = current current = current.left other = previous.right else: previous = other current = other.left other = other.right return previous, loops </code></pre> 它的工作原理是在一组0和1中搜索“最左边的”1—哈夫曼树必须将更多可能的项目放在左边。例如，如果我使用上述概率和输入： <pre><code>items = [1,0,1,0,0] </code></pre> 那么算法返回的项的索引是2—这不是应该返回的值（0应该，因为它是最左边的）。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我将用工作代码充实马克·阿德勒所说的话。他说的每一句话都是对的：-）要点： <ol> <li>不能使用浮点权重或任何其他丢失权重信息的方案。使用整数。简单而正确。例如，如果你有3位数的浮动概率，通过<code>int(round(the_probability * 1000))</code>将每一个概率转换成一个整数，那么也许可以调整它们以确保总和正好是1000。在</li> <li><code>heapq</code>堆不是“稳定的”：如果多个项具有相同的最小权重，则不会定义弹出哪个项。在</li> <li>所以你在建树的时候就不能得到你想要的东西。在</li> <li>“规范哈夫曼密码”的一个小变种似乎是你想要的。为此构建一棵树是一个冗长的过程，但每一步都足够简单。构建的第一棵树被丢弃：从中获取的唯一信息是分配给每个符号的代码的长度。在</li> </ol> 运行中： <pre><code>syms = ['a','b','c','d','e'] weights = [30, 25, 20, 15, 10] t = encode(syms, weights) print t </code></pre> 打印此文件（格式化以便于阅读）： ^{pr2}$ 据我所知，这正是你想要的。如果不是，就抱怨；-） 编辑：在规范代码的赋值中有一个错误，除非权重非常不同，否则不会显示出来；修复了它。在 <pre><code>class Node(object): def __init__(self, data=None, weight=None, left=None, right=None, code=None): self.data = data self.weight = weight self.left = left self.right = right self.code = code def is_symbol(self): return self.left is self.right is None def __repr__(self): return "[%s,%s,(%s),(%s)]" % (self.data, self.code, self.left, self.right) def __cmp__(self, a): return cmp(self.weight, a.weight) def encode(syms, weights): from heapq import heapify, heappush, heappop tree = [Node(data=s, weight=w) for s, w in zip(syms, weights)] sym2node = {s.data: s for s in tree} heapify(tree) while len(tree) > 1: a, b = heappop(tree), heappop(tree) heappush(tree, Node(weight=a.weight + b.weight, left=a, right=b)) # Compute code lengths for the canonical coding. sym2len = {} def assign_codelen(node, codelen): if node is not None: if node.is_symbol(): sym2len[node.data] = codelen else: assign_codelen(node.left, codelen + 1) assign_codelen(node.right, codelen + 1) assign_codelen(tree[0], 0) # Create canonical codes, but with a twist: instead # of ordering symbols alphabetically, order them by # their position in the `syms` list. # Construct a list of (codelen, index, symbol) triples. # `index` breaks ties so that symbols with the same # code length retain their original ordering. triples = [(sym2len[name], i, name) for i, name in enumerate(syms)] code = oldcodelen = 0 for codelen, _, name in sorted(triples): if codelen > oldcodelen: code <<= (codelen - oldcodelen) sym2node[name].code = format(code, "0%db" % codelen) code += 1 oldcodelen = codelen # Create a tree corresponding to the new codes. tree = Node(code="") dir2attr = {"0": "left", "1": "right"} for snode in sym2node.values(): scode = snode.code codesofar = "" parent = tree # Walk the tree creating any needed interior nodes. for d in scode: assert parent is not None codesofar += d attr = dir2attr[d] child = getattr(parent, attr) if codesofar == scode: # We're at the leaf position. assert child is None setattr(parent, attr, snode) elif child is not None: assert child.code == codesofar else: child = Node(code=codesofar) setattr(parent, attr, child) parent = child # Finally, paste the `data` attributes together up # the tree. Why? Don't know ;-) def paste(node): if node is None: return "" elif node.is_symbol(): return node.data else: result = paste(node.left) + paste(node.right) node.data = result return result paste(tree) return tree </code></pre> <h2>重复符号</h2> <blockquote> Could I swap the sym2node dict to an ordereddict to deal with repeated 'a'/'b's etc? </blockquote> 不，有两个原因： <ol> <li>没有映射类型支持重复键；并且</li> <li>“重复符号”的概念对哈夫曼编码毫无意义。在</li> </ol> 所以，如果你下定决心要追求这个目标，首先你必须确保符号是唯一的。只需在函数的开头添加以下行： <pre><code> syms = list(enumerate(syms)) </code></pre> 例如，如果传入的<code>syms</code>是： <pre><code>['a', 'b', 'a'] </code></pre> 将更改为： <pre><code>[(0, 'a'), (1, 'b'), (2, 'a')] </code></pre> 现在所有的符号都是2元组，而且显然是唯一的，因为每个符号都以唯一的整数开头。算法关心的唯一问题是符号可以用作dict键；它不关心它们是字符串、元组还是任何其他支持等式测试的散列类型。在 所以算法中没有什么需要改变的。但在结束之前，我们要恢复原来的符号。只需在<code>paste()</code>函数之前插入： <pre><code> def restore_syms(node): if node is None: return elif node.is_symbol(): node.data = node.data[1] else: restore_syms(node.left) restore_syms(node.right) restore_syms(tree) </code></pre> 它只需遍历树并从符号的<code>.data</code>成员中去掉前导整数。或者，也许更简单，只需迭代<code>sym2node.values()</code>，并转换每个成员的<code>.data</code>成员。在

将列表映射到Huffman树，同时保留相对ord

1 个回答

相关Python问题