《Think Python: How to Think Like a Computer Scientist》中第9.3题有更好的算法吗？

4 投票

4 回答

1157 浏览

提问于 2025-04-17 22:57

这本书中的练习9.3要求读者找到5个禁用字母的组合，这个组合能排除尽可能少的单词，数据来源于这个文件。

下面是我对第一部分的解决方案，我觉得没有问题。

# if the word contain any letter in letters, return True,
# otherwise return False
def contain(word, letters):
    for letter in letters:
        if letter in word:
            return True
    return False

# return the number of words contain any letter in letters
def ncont(words, letters):
    count = 0
    for word in words:
        if contain(word, letters):
            count += 1
return count

但是对于上面的问题，我只能想到一种暴力算法，也就是尝试所有可能的组合，实际上有26! / 5! = 65780种组合，下面是实现的代码：

def get_lset(nlt, alphabet, cur_set):
    global min_n, min_set
    # when get enough letters 
    if nlt <= 0:
        cur_n = ncont(words, ''.join(cur_set))
        if min_n == -1 or cur_n < min_n:
            min_n = cur_n
            min_set = cur_set.copy()
        print(''.join(cur_set), cur_n, ' *->', min_n, ''.join(min_set))
    # otherwise find the result letters in a recursive way
    else:
        cur_set.append(None)
        for i in range(len(alphabet)):
            cur_set[-1] = alphabet[i]
            get_lset(nlt-1, alphabet[i+1:], cur_set)
        cur_set.pop()

然后像这样调用上面的函数：

if __name__ == '__main__':
    min_n = -1
    min_set = []
    with open('words.txt', 'r') as fin:
        words = [line.strip() for line in fin]
    get_lset(5, list(string.ascii_lowercase), [])
    print(min_set, min_n)

不过这个解决方案非常慢，我想知道有没有更好的算法来解决这个问题？任何建议都很好！

数据处理计算机科学编程练习算法优化组合数学计算复杂性暴力算法禁用字母

4 个回答

我觉得我有一个更快的解决方案。下面是带注释的代码...

import itertools
import string
import timeit

if __name__ == '__main__':
    # Start timestamp
    start_ts = timeit.default_timer()

    #
    # Small function to calculate the factorial of a number
    # Used in debugging
    #
    # Math: the number of unique combinations of x elements from y elements is calculated as
    #       y! / (y - x)! / x!
    #
    # Or, in 'school' notation
    #
    #          y!
    #    _____________
    #    (y - x)! . x!
    #
    fac = lambda num: 1 if num <= 1 else num * fac(num - 1)

    #
    # Open the file and read the content in memory as a list of strings
    #
    with open("words.txt", "r") as file:
        words = file.readlines()

    #
    # Create a dictionary containing the 26 letters of the English alphabet
    # For each of the letters, set the number the letter appears to 0
    #
    # I prefer to initialize this here instead of dynamically adding them to the dictionary later,
    # as normally this text file will contain all letters and having to check if the element exists will take longer
    #
    appearances = {}

    for letter in string.ascii_lowercase:
        appearances[letter] = 0

    #
    # For each of the words, each of the unique letters, count them into appearances
    # If a letter appears twice or even more, it does not matter.  We count the words that contain the letter
    # at least once.  For our letter set, it does not matter whether the letter appears once or more
    #
    for word in words:
        for letter in list(set(word.strip().lower())):
            appearances[letter] += 1

    # Debug: you will see Q has the least appearances, E has the most
    print(appearances)

    #
    # Let's sort this.  It's key to this algorythm
    #
    # In short:
    #
    # Suppose we only have 5 letters, A to E
    # Suppose we have counted our appearances and this is how many times they show up
    #   A : 10
    #   B : 5
    #   C : 3
    #   D : 7
    #   E : 12
    #
    # Sorted:
    #   C : 3, B: 5, D : 7, A : 10, E : 12
    #
    # Suppose we need combinations of only 2 letters
    # Take C + B
    # In worst case, you have in total 8 words that contain either C or B.  This is the case where no words have both.
    # In best case, you have 5.  This is the case where 3 words contain B and C, 2 words contain only B
    #
    # Given the above, it makes no sense to check any combination with A or E
    # You know they have either 10 or 12 words.  They can't beat B+C in number of appearances
    # So don't include them in the combinations.  This will significantly lower the number of combinations
    #
    # Given the above, you must include D, as you don't know how many words have either B or C (between 5 and 8)
    #
    # On the words.txt, this approach resulted in only 252 combinations to check.  So "with brute" force, you only
    # needed 252 iterations over the possible combinations of 5 characters.  You can verify with the debug code
    #
    #

    #
    # appearances_sorted is a list, we can't calculate on it
    #
    appearances_sorted = sorted(appearances, key=lambda x: appearances[x])

    print(appearances_sorted)
    print(appearances_sorted[:5])

    #
    # Calculate the least amount possible.  This is the sum of the 5 lowest appearances
    # As we are looping over the first 5, we already put them in our list of combinations to check
    #
    sum_least = 0
    appearances_least = {}

    for k in appearances_sorted[:5]:
        v = appearances[k]
        sum_least += v
        appearances_least[k] = v

    print(sum_least)
    print(appearances_least)

    #
    # For the rest of the sorted appearances, we add them, unless the appearance of the character by itself
    # is already higher than the sum we calculated
    #
    for k in appearances_sorted[5:]:
        if appearances[k] > sum_least:
            break

        appearances_least[k] = appearances[k]

    print(appearances_least)

    #
    # Debug code to check the math against the len of the calculations Python will provide
    #
    # f1 = fac(len(appearances_least))
    # f2 = fac(len(appearances_least) - 5)
    # f3 = fac(5)
    # print(f1 / f2 / f3)
    #

    #
    # Create all the possible combinations using itertools
    # One advantage is also that we can do this on a sorted list, the combinations with the smallest possible
    # appearances appear first.  But as said, as we don't know the words that have multiple letters combined, we
    # cannot be sure we only need to check the first
    #
    combinations = list(itertools.combinations(appearances_least, 5))

    # This will print 252 on the words.txt file
    print(len(combinations))

    #
    # How many words in total do we have
    # This total will be used as a starting point to see how a combination is done
    # The worst combination possible will never be in more words than the file contains
    #
    total_words = len(words)
    min_found = total_words
    print(total_words)

    #
    # Just to avoid that PyCharm complains that best_combo might not be set later
    #
    best_combo = combinations[0]

    #
    # Loop over all the combos we have, as we cannot be sure on the words that have multiple letters
    # When we calculated the appearances, we were calculating only per letter
    #
    for combo in combinations:
        count_matches = 0

        #
        # Loop over the words, then over the letters in the combo
        # If one of the letters is found, add the counter and stop the loop as it does not matter if other characters
        # of the combo also appear.  One is enough to count it.
        #
        #
        for word in words:
            for letter in combo:
                if letter in word:
                    count_matches += 1
                    break

            #
            # If we already found more words than the minimum we have detected already, we can stop the loop.  This
            # combo will not be better, it will only get worse.
            #
            if count_matches > min_found:
                break

        #
        # If we found a better one, store it
        #
        if count_matches < min_found:
            best_combo = combo
            min_found = count_matches

    # End timestamp
    end_ts = timeit.default_timer()

    #
    # Print the results
    #
    print(best_combo)
    print(min_found)
    print(end_ts - start_ts)
    #
    # I have:
    #
    # ('q', 'j', 'x', 'z', 'w')
    # 17382
    # 4.387889001052827
    #
    # Enjoy !

回答于 2025-04-17 由 Python大师

分享举报

我在这里的解决方案：

    def smallest_set(filename):
        avoid_dict = dict.fromkeys(ascii_letters.lower(), 0)
        with open(filename) as file_handler:
            for line in file_handler:
                for key in avoid_dict:
                    if key not in line:
                        avoid_dict[key] += 1
        avoid_stats_sorted = sorted(avoid_dict, key=avoid_dict.get,
reverse=True)
        return ''.join([item for item in avoid_stats_sorted[:5]])

回答于 2025-04-17 由 Python大师

分享举报

这是我的想法：

首先，计算一个叫做 excluded[l] 的东西，它是一个把字母映射到被排除的单词集合的列表。

然后，找出这26个集合中最小的五个集合，并把它们合并在一起。这样你就得到了一个比较合理的“临时最小结果”。

接下来，不要用 itertools.combinations 来探索所有5个字母的组合，而是自己写一个算法来完成这个任务。在这个算法中，计算“排除”集合的合并。如果在前 i 个字母（i<5）中，“排除”集合的合并结果已经比“临时最小结果”大，那就不需要考虑后面的字母了。当你找到一个比当前“临时最小结果”更好的五个字母组合时，就更新这个结果。

回答于 2025-04-17 由 Python大师

分享举报

首先，我们来把它写得更简洁一些。

def contain(word, letters):
    return any(letter in word for letter in letters)

def ncont(words, letters):
    return sum(contain(word, letters) for word in words):

目前你的算法的平均复杂度是这样的。

O(len(letters) * len(a_word) * len(words))
  ---+----------------------   -+--------
     contain(word, letters)     ncont(words, letters)

我们可以通过使用 set 来降低这个复杂度：

def contain(word, letters):
    return not set(letters).isdisjoint(set(word))

这就简化成了：

O(min(len(letters), len(a_word)) * len(words))
  ---+--------------------------   -+--------
     contain(word, letters)        ncont(words, letters)

根据这个链接 https://wiki.python.org/moin/TimeComplexity

至于第二部分，使用 itertools 会让算法更容易理解：

import itertools

def minimum_letter_set(words, n):
    attempts = itertools.combinations(string.ascii_lowercase, n)
    return min(attempts, key=lambda attempt: ncont(words, attempt))

不过，我们可以做得更好：

def minimum_letter_set(words, n):
    # build a lookup table for each letter to the set of words it features in
    by_letter = {
        letter: {
            word
            for word in words
            if letter in word
        }
        for letter in string.ascii_lowercase
    }

    # allowing us to define a function that finds words that match multiple letters
    def matching_words(letters):
        return set.union(*(by_letter[l] for l in letters))

    # find all 5 letter combinations
    attempts = itertools.combinations(string.ascii_lowercase, n)

    # and return the one that matches the fewest words
    return min(attempts, key=lambda a: len(matching_words(a))))

我不确定这样是否真的能降低算法的复杂度，但它确实避免了重复过滤单词列表的工作。

回答于 2025-04-17 由 Python大师

分享举报

《Think Python: How to Think Like a Computer Scientist》中第9.3题有更好的算法吗？

4 个回答

撰写回答