需要帮助解决词汇打包算法问题

17 投票

4 回答

812 浏览

提问于 2025-04-16 02:53

我有一个字母的子列表，每个子列表里的字母数量可以不同。这些列表和子列表是有顺序的。这个结构可以用来生成单词，方法是选择一个数字X，从每个子列表的第X个位置取一个字母，然后按顺序把它们连起来。如果X的值超过了子列表的长度，就会从头开始取。

我有一组单词，需要找到一种方法，把它们放进一个尽可能小的这种结构里（也就是子列表尽量短）。当然，子列表的数量要和最长单词的字母数相等，而短一些的单词则需要用空格来填充。

我不是计算机专业毕业的，所以如果我描述的问题不太清楚，我先说声抱歉。给个简单的例子：假设我有这些单词 [ 'a ', 'an', 'if', 'is', 'in', 'on', 'of', 'i '] 我需要把它们放进去，我可以用下面的结构：

[  
    [ 'i', 'o', 'a' ],  
    [ 's', 'n', 'f', ' ' ]  
]

这样我就能生成以下单词：

0: is  
1: on  
2: af*  
3: i  
4: os*  
5: an  
6: if  
7: o *  
8: as*  
9: in  
10: of  
11: a

比如说，如果你取位置10，单词'of'是通过从第一个子列表取第10 % 3 (= 1)个字母，再从第二个子列表取第10 % 4 (= 2)个字母拼接而成的。

到目前为止，我最好的尝试是使用汉明距离的矩阵，先放置最“连接”的单词，然后是它们最近的邻居，目标是每次插入时尽量减少变化。这完全是凭直觉的尝试，我觉得应该有更好、更聪明的方法来解决这个问题。

澄清

这是我正在尝试解决的一个实际问题，约束条件大致如下：
1. 每个子列表的字符数量应该在100个左右，或者更少。
2. 密钥空间应该尽可能小（也就是说，虚假单词的数量要尽量少）。大致来说，千万级别的密钥空间是个边界。

我不知道是否真的能找到一个好的解决方案。比如说，按照我现在的算法，我可以在150万选项的密钥空间中插入大约200个单词（只是随机的英文单词）。我希望能做得更好。

数据结构算法优化单词生成组合数学词汇打包字母子列表汉明距离密钥空间

4 个回答

我不太确定我完全理解你的问题，不过我之前碰到过一个叫prezip的东西。prezip是一种压缩已经排好序的单词集合的方法，它利用了很多单词有共同前缀的这个特点。

因为你没有提到任何排序的限制，我建议你先创建一个你想要的排好序的单词集合。然后做一些类似prezip的事情。这样你就能得到一个压缩过的、已经排序的单词集合，你可以通过索引来访问它。

回答于 2025-04-16 由 Python大师

分享举报

我们可以在Nikita Rybek的回答基础上进行改进，不必让列表的长度是质数，而是把一个质数和这个列表关联起来。这样，我们就可以避免让子列表变得过长，从而保持质数较小，这意味着可以减少密钥空间，打包效率也更高。使用这种方法和下面的代码，我把一个包含58,110（小写）单词的列表压缩到了464个字符。有趣的是，只有字母'alex'在单词中作为第21个字母出现。密钥空间大约有33位数字，不过其实并不一定非得用质数，相关的数字只需要互质就可以。这部分可能还可以进一步减少。

import itertools
import operator
import math

# lifted from Alex Martelli's post at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python
def erat2( ):
    D = {  }
    yield 2
    for q in itertools.islice(itertools.count(3), 0, None, 2):
        p = D.pop(q, None)
        if p is None:
            D[q*q] = q
            yield q
        else:
            x = p + q
            while x in D or not (x&1):
                x += p
            D[x] = p


# taken from http://en.literateprograms.org/Extended_Euclidean_algorithm_(Python)
def eea(u, v):
    u1 = 1; u2 = 0; u3 = u
    v1 = 0; v2 = 1; v3 = v
    while v3 != 0:
        q = u3 / v3
        t1 = u1 - q * v1
        t2 = u2 - q * v2
        t3 = u3 - q * v3
        u1 = v1; u2 = v2; u3 = v3
        v1 = t1; v2 = t2; v3 = t3
    return u1, u2, u3

def assign_moduli(lists):
    used_primes = set([])
    unused_primes = set([])
    moduli = [0]*len(lists)
    list_lens = [len(lst) for lst in lists]
    for i, length in enumerate(list_lens):
        for p in erat2():
            if length <= p and p not in used_primes:
                used_primes.add(p)
                moduli[i] = p
                break
            elif p not in used_primes:
                unused_primes.add(p)
    return moduli



class WordEncoder(object):
    def __init__(self):
        self.lists = [[]] # the list of primedlists
        self.words = {} # keys are words, values are number that retrieves word
        self.moduli = [] # coprime moduli that are used to assign unique keys to words

    def add(self, new_words):
        added_letter = False # flag that we need to rebuild the keys
        for word in new_words:
            word = word.rstrip() # a trailing blank space could hide the need for a key rebuild
            word_length, lists_length = len(word), len(self.lists)
            # make sure we have enough lists
            if word_length > lists_length:
                self.lists.extend([' '] for i in xrange(word_length - lists_length))
            # make sure that each letter is in the appropriate list
            for i, c in enumerate(word):
                if c in self.lists[i]: continue
                self.lists[i].append(c)
                added_letter = True
            self.words[word] = None
        # now we recalculate all of the keys if necessary
        if not added_letter:
            return self.words
        else:
            self._calculate_keys()

    def _calculate_keys(self):
        # were going to be solving a lot of systems of congruences
        # these are all of the form x % self.lists[i].modulus == self.lists[i].index(word[i]) with word padded out to 
        # len(self.lists). We will be using the Chinese Remainder Theorem to do this. We can do a lot of the calculations
        # once before we enter the loop because the numbers that we need are related to self.lists[i].modulus and not
        # the indexes of the necessary letters
        self.moduli = assign_moduli(self.lists)
        N  = reduce(operator.mul, self.moduli)
        e_lst = []
        for n in self.moduli:
             r, s, dummy = eea(n, N/n)
             e_lst.append(s * N / n)
        lists_len = len(self.lists)
        #now we begin the actual recalculation 
        for word in self.words:
             word += ' ' * (lists_len - len(word))
             coords = [self.lists[i].index(c) for i,c in enumerate(word)]
             key = sum(a*e for a,e in zip(coords, e_lst)) % N  # this solves the system of congruences
             self.words[word.rstrip()] = key

class WordDecoder(object):
    def __init__(self, lists):
       self.lists = lists
       self.moduli = assign_moduli(lists)

    def decode(self, key):
        coords = [key % modulus for modulus in self.moduli]
        return ''.join(pl[i] for pl, i in zip(self.lists, coords))    


with open('/home/aaron/code/scratch/corncob_lowercase.txt') as f:
    wordlist = f.read().split()

encoder = WordEncoder()
encoder.add(wordlist)

decoder = WordDecoder(encoder.lists)

for word, key in encoder.words.iteritems():
    decoded = decoder.decode(key).rstrip()
    if word != decoded:
        print word, decoded, key
        print "max key size: {0}. moduli: {1}".format(reduce(operator.mul, encoder.moduli), encoder.moduli)
        break
else:
    print "it works"
    print "max key size: {0}".format(reduce(operator.mul, encoder.moduli))
    print "moduli: {0}".format(encoder.moduli)
    for i, l in enumerate(encoder.lists):
        print "list {0} length: {1}, {2} - \"{3}\"".format(i, len(l), encoder.moduli[i] - len(l), ''.join(sorted(l)))
    print "{0} words stored in {1} charachters".format(len(encoder.words), sum(len(l) for l in encoder.lists))

回答于 2025-04-16 由 Python大师

分享举报

好吧，你说你对次优解感兴趣，那我就给你一个。这个解法取决于字母表的大小。比如，对于26个字母的情况，数组的大小会稍微超过100（和要编码的单词数量无关）。

大家都知道，如果你有两个不同的质数 a 和 b，以及非负整数 k 和 l（k < a，l < b），你可以找到一个数字 n，使得 n % a == k 和 n % b == l。
举个例子，假设 a = 7, b = 13, k = 6, l = 3，你可以用 n = 7 * 13 + 7 * 3 + 13 * 6 来计算。这样 n % 7 == 6 和 n % 13 == 3。

这个方法对任何数量的质数都适用。

你可以这样初始化数组。

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

现在，假设你的单词是'geek'。为了这个单词，你需要一个数字X，使得 X % 29 == 6，X % 31 == 4，X % 37 == 4，X % 41 == 10。而且你总是能找到这样的X，正如上面所示。

所以，如果你的字母表有26个字母，你可以创建一个宽度为149的矩阵（见质数列表），用它来编码任何单词。

回答于 2025-04-16 由 Python大师

分享举报

需要帮助解决词汇打包算法问题

澄清

4 个回答

撰写回答