Python- 哪个词可以去掉最多连续字母仍然是有效词?

12 投票
3 回答
2600 浏览
提问于 2025-04-16 18:06

我用了一种很糟糕且效率低下的方法来找出一个单词,看看能连续去掉多少个最后的字母而仍然是一个单词。

比如“Rodeo”,这是一个大家都知道的例子:Rodeo, Rode, Rod, Ro。这个程序找到了“composers”:Composers, Composer, Compose, Compos, Comp。

我在想,怎样才能写一个程序,找到一个最长的单词,去掉它的任何字母(不仅仅是最后的字母)后,仍然被认为是一个单词:

例如:beast, best, bet, be -- 这些都是有效的可能性。

这是我用来找出连续去掉字母的单词的程序(我也很想知道如何改进和优化这个程序):

#Recursive function that finds how many letters can be removed from a word and
#it still be valid.  
def wordCheck(word, wordList, counter):

    if len(word)>=1:
        if word in wordList:
            return (wordCheck(word[0:counter-1], wordList, counter-1))
        else:
            return counter
    return counter


def main():
    a = open('C:\\Python32\\megalist2.txt', 'r+')
    wordList = set([line.strip() for line in a])
    #megaList contains a sorted list of tuple of 
    #(the word, how many letters can be removed  consecutively)
    megaList = sorted([(i, len(i)-1- wordCheck(i, wordList, len(i))) for i in wordList], key= lambda megaList: megaList[1])


    for i in megaList:
        if i[1] > 3:
            print (i)

if __name__ == '__main__':
    main()

3 个回答

1

也许我只是没理解这个练习的重点,但我觉得一个简单的规则应该能减少很多搜索的工作。特别是如果你只是想找一个字,看看能去掉多少字母,你可能只需要关注那些最长的单词,看看它们里有没有包含最短的单词。

举个例子,很多单词都包含字母“a”和“i”,这两个字本身也是有效的英语单词。而且,越长的单词越可能包含这两个字母。所以,你可以直接跳过那些不含“a”或“i”的单词。

其实你可以把这个想法融入到Greg的解决方案中,前提是你先把单词列表排序,也就是说:

# Similar to Greg's.  Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)

# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1

# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
    # If you've already calculated the value, return it
    if words[w] is not None:
        return words[w]
    # Recursively calculate how many letters you can remove
    shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
    # Calculate how max # of letters you can remove from shortened words
    totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
    # Total removed will be the same or will increase due to removals from shorter words
    totalRemoved = max(totalRemoved, alreadyRemovedCount)
    # Cache the result and return it
    words[w] = totalRemoved
    return totalRemoved 

# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
    if 'a' in w or 'i' in w:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-1:
            break

# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
    for w in sortedwords:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-2:
            break

这个方法有几个不同之处。首先,它会先排序,这样你就能找到最长的单词。其次,它在第一次遍历时完全忽略不包含“a”或“i”的单词。第三,它不需要计算每一个单词或整个树结构就能得到结果。相反,它只在需要的时候递归地查看单词。

每次它去掉一个字母并找到一个新单词时,它会运行同样的函数来找出这个较小单词能去掉多少字母,加上从原始单词中已经去掉的字母数量。理论上,这个过程应该很快,因为它不需要处理大多数单词,因为它使用了一种典型的优化技巧:检查是否达到了最佳边界。首先,它在包含“i”或“a”的单词中找到最佳可能性。然后,它会检查比找到的最佳单词更长的单词,以确保没有更好的选项,即使它不包含这两个字母,但至少比它长两个字母(这样理论上可能会更好)。

可能还有一些改进的方法,可以更好地利用英语的规律,使用概率算法,但我觉得这个确定性的方法应该也可以。此外,我手头没有字典,所以我实际上不能...呃...运行这个,但这些概念是合理的。

另外,我并不完全相信排序这个键的列表是值得的。虽然Python的排序算法很快,但处理一个大列表仍然可能会有相当大的成本。一个理想的算法可能需要考虑这个成本,并决定是否值得(可能不值得)。如果不排序列表,你可能希望第一次遍历只考虑某个最小长度的单词,甚至可能作为一个更大循环的一部分。其实没有必要计算那些可能与解决方案无关的单词。

10
for each english word W:
    for each letter you can remove:
        remove the letter
        if the result W' is also word:
            draw a line W->W'
for each english word W:
    connect ROOT-> each english word W
use a graph search algorithm to find the longest path starting at ROOT
    (specifically, the words are now in a directed acyclic graph; traverse
    the graph left-right-top-down, i.e. in a "topological sort", keeping
    track of the longest candidate path to reach each node; this achieves 
    the result in linear time)

这个算法的运行时间是线性的,也就是说,它的速度和输入的单词数量以及单词的平均长度成正比!简单来说,就是读取输入所需的时间。

这个算法可以很容易地改进,用来找到被删除的连续字母:不是每个节点只保留一个候选字母,比如(Node('rod').candidate = rodeo->rode->rod),而是每个节点保留一组候选字母,并且记录你删除的字母的索引,这样就能得到每个候选字母(Node('rod').candidates={rodeo->rod|e->rod|, road->ro|d})。这样做的运行时间是一样的。

8

这是我刚写的一个实现。用我大约23.5万的单词列表运行大约需要五秒钟。输出结果没有显示完整的链条,但你可以很容易地从输出中重新拼凑出来。

# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))

# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
    for i in range(len(w)):
        shorter = w[:i] + w[i+1:]
        if shorter in words:
            words[w].add(shorter)

# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)

# For each word, the maximum chain length is:
#  - the maximum of the chain lengths of each shorter word, if any
#  - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
    if words[w]:
        maxlength[w] = 1 + max(maxlength[x] for x in words[w])
    else:
        maxlength[w] = 0

# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
    w = toshow[0]
    print(w, [(x, maxlength[x]) for x in words[w]])
    toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)

我字典中最长的单词链是:

  • abranchiate
  • branchiate
  • branchiae
  • branchia
  • branchi
  • branch
  • ranch
  • rach
  • ach
  • ah
  • a

撰写回答