Python- 哪个词可以去掉最多连续字母仍然是有效词?
我用了一种很糟糕且效率低下的方法来找出一个单词,看看能连续去掉多少个最后的字母而仍然是一个单词。
比如“Rodeo”,这是一个大家都知道的例子:Rodeo, Rode, Rod, Ro。这个程序找到了“composers”:Composers, Composer, Compose, Compos, Comp。
我在想,怎样才能写一个程序,找到一个最长的单词,去掉它的任何字母(不仅仅是最后的字母)后,仍然被认为是一个单词:
例如:beast, best, bet, be -- 这些都是有效的可能性。
这是我用来找出连续去掉字母的单词的程序(我也很想知道如何改进和优化这个程序):
#Recursive function that finds how many letters can be removed from a word and
#it still be valid.
def wordCheck(word, wordList, counter):
if len(word)>=1:
if word in wordList:
return (wordCheck(word[0:counter-1], wordList, counter-1))
else:
return counter
return counter
def main():
a = open('C:\\Python32\\megalist2.txt', 'r+')
wordList = set([line.strip() for line in a])
#megaList contains a sorted list of tuple of
#(the word, how many letters can be removed consecutively)
megaList = sorted([(i, len(i)-1- wordCheck(i, wordList, len(i))) for i in wordList], key= lambda megaList: megaList[1])
for i in megaList:
if i[1] > 3:
print (i)
if __name__ == '__main__':
main()
3 个回答
也许我只是没理解这个练习的重点,但我觉得一个简单的规则应该能减少很多搜索的工作。特别是如果你只是想找一个字,看看能去掉多少字母,你可能只需要关注那些最长的单词,看看它们里有没有包含最短的单词。
举个例子,很多单词都包含字母“a”和“i”,这两个字本身也是有效的英语单词。而且,越长的单词越可能包含这两个字母。所以,你可以直接跳过那些不含“a”或“i”的单词。
其实你可以把这个想法融入到Greg的解决方案中,前提是你先把单词列表排序,也就是说:
# Similar to Greg's. Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)
# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1
# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
# If you've already calculated the value, return it
if words[w] is not None:
return words[w]
# Recursively calculate how many letters you can remove
shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
# Calculate how max # of letters you can remove from shortened words
totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
# Total removed will be the same or will increase due to removals from shorter words
totalRemoved = max(totalRemoved, alreadyRemovedCount)
# Cache the result and return it
words[w] = totalRemoved
return totalRemoved
# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
if 'a' in w or 'i' in w:
# Get the number of letters you can remove
numRemoved = getMaxLettersRemoved(w, words)
# Save the best one found
if numRemoved > bestNumRemoved:
bestWord = w
bestNumRemoved = numRemoved
# Stop if you can't do better
if bestNumRemoved >= len(w)-1:
break
# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
for w in sortedwords:
# Get the number of letters you can remove
numRemoved = getMaxLettersRemoved(w, words)
# Save the best one found
if numRemoved > bestNumRemoved:
bestWord = w
bestNumRemoved = numRemoved
# Stop if you can't do better
if bestNumRemoved >= len(w)-2:
break
这个方法有几个不同之处。首先,它会先排序,这样你就能找到最长的单词。其次,它在第一次遍历时完全忽略不包含“a”或“i”的单词。第三,它不需要计算每一个单词或整个树结构就能得到结果。相反,它只在需要的时候递归地查看单词。
每次它去掉一个字母并找到一个新单词时,它会运行同样的函数来找出这个较小单词能去掉多少字母,加上从原始单词中已经去掉的字母数量。理论上,这个过程应该很快,因为它不需要处理大多数单词,因为它使用了一种典型的优化技巧:检查是否达到了最佳边界。首先,它在包含“i”或“a”的单词中找到最佳可能性。然后,它会检查比找到的最佳单词更长的单词,以确保没有更好的选项,即使它不包含这两个字母,但至少比它长两个字母(这样理论上可能会更好)。
可能还有一些改进的方法,可以更好地利用英语的规律,使用概率算法,但我觉得这个确定性的方法应该也可以。此外,我手头没有字典,所以我实际上不能...呃...运行这个,但这些概念是合理的。
另外,我并不完全相信排序这个键的列表是值得的。虽然Python的排序算法很快,但处理一个大列表仍然可能会有相当大的成本。一个理想的算法可能需要考虑这个成本,并决定是否值得(可能不值得)。如果不排序列表,你可能希望第一次遍历只考虑某个最小长度的单词,甚至可能作为一个更大循环的一部分。其实没有必要计算那些可能与解决方案无关的单词。
for each english word W:
for each letter you can remove:
remove the letter
if the result W' is also word:
draw a line W->W'
for each english word W:
connect ROOT-> each english word W
use a graph search algorithm to find the longest path starting at ROOT
(specifically, the words are now in a directed acyclic graph; traverse
the graph left-right-top-down, i.e. in a "topological sort", keeping
track of the longest candidate path to reach each node; this achieves
the result in linear time)
这个算法的运行时间是线性的,也就是说,它的速度和输入的单词数量以及单词的平均长度成正比!简单来说,就是读取输入所需的时间。
这个算法可以很容易地改进,用来找到被删除的连续字母:不是每个节点只保留一个候选字母,比如(Node('rod').candidate = rodeo->rode->rod
),而是每个节点保留一组候选字母,并且记录你删除的字母的索引,这样就能得到每个候选字母(Node('rod').candidates={rodeo->rod|e->rod|
, road->ro|d
})。这样做的运行时间是一样的。
这是我刚写的一个实现。用我大约23.5万的单词列表运行大约需要五秒钟。输出结果没有显示完整的链条,但你可以很容易地从输出中重新拼凑出来。
# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))
# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
for i in range(len(w)):
shorter = w[:i] + w[i+1:]
if shorter in words:
words[w].add(shorter)
# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)
# For each word, the maximum chain length is:
# - the maximum of the chain lengths of each shorter word, if any
# - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
if words[w]:
maxlength[w] = 1 + max(maxlength[x] for x in words[w])
else:
maxlength[w] = 0
# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
w = toshow[0]
print(w, [(x, maxlength[x]) for x in words[w]])
toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)
我字典中最长的单词链是:
- abranchiate
- branchiate
- branchiae
- branchia
- branchi
- branch
- ranch
- rach
- ach
- ah
- a