《Think Python: How to Think Like a Computer Scientist》中第9.3题有更好的算法吗?
这本书中的练习9.3要求读者找到5个禁用字母的组合,这个组合能排除尽可能少的单词,数据来源于这个文件。
下面是我对第一部分的解决方案,我觉得没有问题。
# if the word contain any letter in letters, return True,
# otherwise return False
def contain(word, letters):
for letter in letters:
if letter in word:
return True
return False
# return the number of words contain any letter in letters
def ncont(words, letters):
count = 0
for word in words:
if contain(word, letters):
count += 1
return count
但是对于上面的问题,我只能想到一种暴力算法,也就是尝试所有可能的组合,实际上有26! / 5! = 65780种组合,下面是实现的代码:
def get_lset(nlt, alphabet, cur_set):
global min_n, min_set
# when get enough letters
if nlt <= 0:
cur_n = ncont(words, ''.join(cur_set))
if min_n == -1 or cur_n < min_n:
min_n = cur_n
min_set = cur_set.copy()
print(''.join(cur_set), cur_n, ' *->', min_n, ''.join(min_set))
# otherwise find the result letters in a recursive way
else:
cur_set.append(None)
for i in range(len(alphabet)):
cur_set[-1] = alphabet[i]
get_lset(nlt-1, alphabet[i+1:], cur_set)
cur_set.pop()
然后像这样调用上面的函数:
if __name__ == '__main__':
min_n = -1
min_set = []
with open('words.txt', 'r') as fin:
words = [line.strip() for line in fin]
get_lset(5, list(string.ascii_lowercase), [])
print(min_set, min_n)
不过这个解决方案非常慢,我想知道有没有更好的算法来解决这个问题?任何建议都很好!
4 个回答
0
我觉得我有一个更快的解决方案。下面是带注释的代码...
import itertools
import string
import timeit
if __name__ == '__main__':
# Start timestamp
start_ts = timeit.default_timer()
#
# Small function to calculate the factorial of a number
# Used in debugging
#
# Math: the number of unique combinations of x elements from y elements is calculated as
# y! / (y - x)! / x!
#
# Or, in 'school' notation
#
# y!
# _____________
# (y - x)! . x!
#
fac = lambda num: 1 if num <= 1 else num * fac(num - 1)
#
# Open the file and read the content in memory as a list of strings
#
with open("words.txt", "r") as file:
words = file.readlines()
#
# Create a dictionary containing the 26 letters of the English alphabet
# For each of the letters, set the number the letter appears to 0
#
# I prefer to initialize this here instead of dynamically adding them to the dictionary later,
# as normally this text file will contain all letters and having to check if the element exists will take longer
#
appearances = {}
for letter in string.ascii_lowercase:
appearances[letter] = 0
#
# For each of the words, each of the unique letters, count them into appearances
# If a letter appears twice or even more, it does not matter. We count the words that contain the letter
# at least once. For our letter set, it does not matter whether the letter appears once or more
#
for word in words:
for letter in list(set(word.strip().lower())):
appearances[letter] += 1
# Debug: you will see Q has the least appearances, E has the most
print(appearances)
#
# Let's sort this. It's key to this algorythm
#
# In short:
#
# Suppose we only have 5 letters, A to E
# Suppose we have counted our appearances and this is how many times they show up
# A : 10
# B : 5
# C : 3
# D : 7
# E : 12
#
# Sorted:
# C : 3, B: 5, D : 7, A : 10, E : 12
#
# Suppose we need combinations of only 2 letters
# Take C + B
# In worst case, you have in total 8 words that contain either C or B. This is the case where no words have both.
# In best case, you have 5. This is the case where 3 words contain B and C, 2 words contain only B
#
# Given the above, it makes no sense to check any combination with A or E
# You know they have either 10 or 12 words. They can't beat B+C in number of appearances
# So don't include them in the combinations. This will significantly lower the number of combinations
#
# Given the above, you must include D, as you don't know how many words have either B or C (between 5 and 8)
#
# On the words.txt, this approach resulted in only 252 combinations to check. So "with brute" force, you only
# needed 252 iterations over the possible combinations of 5 characters. You can verify with the debug code
#
#
#
# appearances_sorted is a list, we can't calculate on it
#
appearances_sorted = sorted(appearances, key=lambda x: appearances[x])
print(appearances_sorted)
print(appearances_sorted[:5])
#
# Calculate the least amount possible. This is the sum of the 5 lowest appearances
# As we are looping over the first 5, we already put them in our list of combinations to check
#
sum_least = 0
appearances_least = {}
for k in appearances_sorted[:5]:
v = appearances[k]
sum_least += v
appearances_least[k] = v
print(sum_least)
print(appearances_least)
#
# For the rest of the sorted appearances, we add them, unless the appearance of the character by itself
# is already higher than the sum we calculated
#
for k in appearances_sorted[5:]:
if appearances[k] > sum_least:
break
appearances_least[k] = appearances[k]
print(appearances_least)
#
# Debug code to check the math against the len of the calculations Python will provide
#
# f1 = fac(len(appearances_least))
# f2 = fac(len(appearances_least) - 5)
# f3 = fac(5)
# print(f1 / f2 / f3)
#
#
# Create all the possible combinations using itertools
# One advantage is also that we can do this on a sorted list, the combinations with the smallest possible
# appearances appear first. But as said, as we don't know the words that have multiple letters combined, we
# cannot be sure we only need to check the first
#
combinations = list(itertools.combinations(appearances_least, 5))
# This will print 252 on the words.txt file
print(len(combinations))
#
# How many words in total do we have
# This total will be used as a starting point to see how a combination is done
# The worst combination possible will never be in more words than the file contains
#
total_words = len(words)
min_found = total_words
print(total_words)
#
# Just to avoid that PyCharm complains that best_combo might not be set later
#
best_combo = combinations[0]
#
# Loop over all the combos we have, as we cannot be sure on the words that have multiple letters
# When we calculated the appearances, we were calculating only per letter
#
for combo in combinations:
count_matches = 0
#
# Loop over the words, then over the letters in the combo
# If one of the letters is found, add the counter and stop the loop as it does not matter if other characters
# of the combo also appear. One is enough to count it.
#
#
for word in words:
for letter in combo:
if letter in word:
count_matches += 1
break
#
# If we already found more words than the minimum we have detected already, we can stop the loop. This
# combo will not be better, it will only get worse.
#
if count_matches > min_found:
break
#
# If we found a better one, store it
#
if count_matches < min_found:
best_combo = combo
min_found = count_matches
# End timestamp
end_ts = timeit.default_timer()
#
# Print the results
#
print(best_combo)
print(min_found)
print(end_ts - start_ts)
#
# I have:
#
# ('q', 'j', 'x', 'z', 'w')
# 17382
# 4.387889001052827
#
# Enjoy !
0
我在这里的解决方案:
def smallest_set(filename):
avoid_dict = dict.fromkeys(ascii_letters.lower(), 0)
with open(filename) as file_handler:
for line in file_handler:
for key in avoid_dict:
if key not in line:
avoid_dict[key] += 1
avoid_stats_sorted = sorted(avoid_dict, key=avoid_dict.get,
reverse=True)
return ''.join([item for item in avoid_stats_sorted[:5]])
0
这是我的想法:
首先,计算一个叫做 excluded[l] 的东西,它是一个把字母映射到被排除的单词集合的列表。
然后,找出这26个集合中最小的五个集合,并把它们合并在一起。这样你就得到了一个比较合理的“临时最小结果”。
接下来,不要用 itertools.combinations 来探索所有5个字母的组合,而是自己写一个算法来完成这个任务。在这个算法中,计算“排除”集合的合并。如果在前 i 个字母(i<5)中,“排除”集合的合并结果已经比“临时最小结果”大,那就不需要考虑后面的字母了。当你找到一个比当前“临时最小结果”更好的五个字母组合时,就更新这个结果。
3
首先,我们来把它写得更简洁一些。
def contain(word, letters):
return any(letter in word for letter in letters)
def ncont(words, letters):
return sum(contain(word, letters) for word in words):
目前你的算法的平均复杂度是这样的。
O(len(letters) * len(a_word) * len(words))
---+---------------------- -+--------
contain(word, letters) ncont(words, letters)
我们可以通过使用 set
来降低这个复杂度:
def contain(word, letters):
return not set(letters).isdisjoint(set(word))
这就简化成了:
O(min(len(letters), len(a_word)) * len(words))
---+-------------------------- -+--------
contain(word, letters) ncont(words, letters)
根据这个链接 https://wiki.python.org/moin/TimeComplexity
至于第二部分,使用 itertools 会让算法更容易理解:
import itertools
def minimum_letter_set(words, n):
attempts = itertools.combinations(string.ascii_lowercase, n)
return min(attempts, key=lambda attempt: ncont(words, attempt))
不过,我们可以做得更好:
def minimum_letter_set(words, n):
# build a lookup table for each letter to the set of words it features in
by_letter = {
letter: {
word
for word in words
if letter in word
}
for letter in string.ascii_lowercase
}
# allowing us to define a function that finds words that match multiple letters
def matching_words(letters):
return set.union(*(by_letter[l] for l in letters))
# find all 5 letter combinations
attempts = itertools.combinations(string.ascii_lowercase, n)
# and return the one that matches the fewest words
return min(attempts, key=lambda a: len(matching_words(a))))
我不确定这样是否真的能降低算法的复杂度,但它确实避免了重复过滤单词列表的工作。