Python- 哪个词可以去掉最多连续字母仍然是有效词？

12 投票

3 回答

2600 浏览

数据工程师

提问于 2025-04-16 18:06

我用了一种很糟糕且效率低下的方法来找出一个单词，看看能连续去掉多少个最后的字母而仍然是一个单词。

比如“Rodeo”，这是一个大家都知道的例子：Rodeo, Rode, Rod, Ro。这个程序找到了“composers”：Composers, Composer, Compose, Compos, Comp。

我在想，怎样才能写一个程序，找到一个最长的单词，去掉它的任何字母（不仅仅是最后的字母）后，仍然被认为是一个单词：

例如：beast, best, bet, be -- 这些都是有效的可能性。

这是我用来找出连续去掉字母的单词的程序（我也很想知道如何改进和优化这个程序）：

#Recursive function that finds how many letters can be removed from a word and
#it still be valid.  
def wordCheck(word, wordList, counter):

    if len(word)>=1:
        if word in wordList:
            return (wordCheck(word[0:counter-1], wordList, counter-1))
        else:
            return counter
    return counter


def main():
    a = open('C:\\Python32\\megalist2.txt', 'r+')
    wordList = set([line.strip() for line in a])
    #megaList contains a sorted list of tuple of 
    #(the word, how many letters can be removed  consecutively)
    megaList = sorted([(i, len(i)-1- wordCheck(i, wordList, len(i))) for i in wordList], key= lambda megaList: megaList[1])


    for i in megaList:
        if i[1] > 3:
            print (i)

if __name__ == '__main__':
    main()

数据结构算法设计语言处理编程优化词汇分析有效单词单词处理字母去除

3 个回答

也许我只是没理解这个练习的重点，但我觉得一个简单的规则应该能减少很多搜索的工作。特别是如果你只是想找一个字，看看能去掉多少字母，你可能只需要关注那些最长的单词，看看它们里有没有包含最短的单词。

举个例子，很多单词都包含字母“a”和“i”，这两个字本身也是有效的英语单词。而且，越长的单词越可能包含这两个字母。所以，你可以直接跳过那些不含“a”或“i”的单词。

其实你可以把这个想法融入到Greg的解决方案中，前提是你先把单词列表排序，也就是说：

# Similar to Greg's.  Reads into a dict
words = dict((x.strip(), None) for x in open("/usr/share/dict/words"))
# Generates a reverse sorted list from the dict (largest words first)
sortedwords = sorted(words, key=len, reverse=True)

# Largest possible reduction is making a longest word into 1 letter
longestPossible = len(sortedWords[0])-1

# Function that recursively finds shorter words and keeps count of reductions
def getMaxLettersRemoved(w, words, alreadyRemovedCount=0):
    # If you've already calculated the value, return it
    if words[w] is not None:
        return words[w]
    # Recursively calculate how many letters you can remove
    shorterPossibilities = [w[:i] + w[i+1:] for i in xrange(len(w))]
    # Calculate how max # of letters you can remove from shortened words
    totalRemoved = max([getMaxLettersRemoved(w, words, alreadyRemovedCount+1) for shorter in shorterPossibilities if shorter in words])
    # Total removed will be the same or will increase due to removals from shorter words
    totalRemoved = max(totalRemoved, alreadyRemovedCount)
    # Cache the result and return it
    words[w] = totalRemoved
    return totalRemoved 

# Go through words from largest to smallest, so long as they have 'a' or 'i'
bestNumRemoved = 0
for w in sortedwords:
    if 'a' in w or 'i' in w:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-1:
            break

# Need to make sure the best one found is better than any left
if bestNumRemoved < longestPossible:
    for w in sortedwords:
        # Get the number of letters you can remove
        numRemoved = getMaxLettersRemoved(w, words)
        # Save the best one found
        if numRemoved > bestNumRemoved:
            bestWord = w
            bestNumRemoved = numRemoved 
        # Stop if you can't do better
        if bestNumRemoved >= len(w)-2:
            break

这个方法有几个不同之处。首先，它会先排序，这样你就能找到最长的单词。其次，它在第一次遍历时完全忽略不包含“a”或“i”的单词。第三，它不需要计算每一个单词或整个树结构就能得到结果。相反，它只在需要的时候递归地查看单词。

每次它去掉一个字母并找到一个新单词时，它会运行同样的函数来找出这个较小单词能去掉多少字母，加上从原始单词中已经去掉的字母数量。理论上，这个过程应该很快，因为它不需要处理大多数单词，因为它使用了一种典型的优化技巧：检查是否达到了最佳边界。首先，它在包含“i”或“a”的单词中找到最佳可能性。然后，它会检查比找到的最佳单词更长的单词，以确保没有更好的选项，即使它不包含这两个字母，但至少比它长两个字母（这样理论上可能会更好）。

可能还有一些改进的方法，可以更好地利用英语的规律，使用概率算法，但我觉得这个确定性的方法应该也可以。此外，我手头没有字典，所以我实际上不能...呃...运行这个，但这些概念是合理的。

另外，我并不完全相信排序这个键的列表是值得的。虽然Python的排序算法很快，但处理一个大列表仍然可能会有相当大的成本。一个理想的算法可能需要考虑这个成本，并决定是否值得（可能不值得）。如果不排序列表，你可能希望第一次遍历只考虑某个最小长度的单词，甚至可能作为一个更大循环的一部分。其实没有必要计算那些可能与解决方案无关的单词。

回答于 2025-04-16 由 Python大师

分享举报

for each english word W:
    for each letter you can remove:
        remove the letter
        if the result W' is also word:
            draw a line W->W'
for each english word W:
    connect ROOT-> each english word W
use a graph search algorithm to find the longest path starting at ROOT
    (specifically, the words are now in a directed acyclic graph; traverse
    the graph left-right-top-down, i.e. in a "topological sort", keeping
    track of the longest candidate path to reach each node; this achieves 
    the result in linear time)

这个算法的运行时间是线性的，也就是说，它的速度和输入的单词数量以及单词的平均长度成正比！简单来说，就是读取输入所需的时间。

这个算法可以很容易地改进，用来找到被删除的连续字母：不是每个节点只保留一个候选字母，比如(Node('rod').candidate = rodeo->rode->rod)，而是每个节点保留一组候选字母，并且记录你删除的字母的索引，这样就能得到每个候选字母（Node('rod').candidates={rodeo->rod|e->rod|, road->ro|d}）。这样做的运行时间是一样的。

回答于 2025-04-16 由 Python大师

分享举报

这是我刚写的一个实现。用我大约23.5万的单词列表运行大约需要五秒钟。输出结果没有显示完整的链条，但你可以很容易地从输出中重新拼凑出来。

# Load the words into a dictionary
words = dict((x.strip(), set()) for x in open("/usr/share/dict/words"))

# For each word, remove each letter and see if the remaining word is still
# in the dictionary. If so, add it to the set of shorter words associated with
# that word in the dictionary.
# For example, bear -> {ear, bar, ber}
for w in words:
    for i in range(len(w)):
        shorter = w[:i] + w[i+1:]
        if shorter in words:
            words[w].add(shorter)

# Sort the words by length so we process the shortest ones first
sortedwords = sorted(words, key=len)

# For each word, the maximum chain length is:
#  - the maximum of the chain lengths of each shorter word, if any
#  - or 0 if there are no shorter words for this word
# Note that because sortedwords is sorted by length, we will always
# have maxlength[x] already available for each shorter word x
maxlength = {}
for w in sortedwords:
    if words[w]:
        maxlength[w] = 1 + max(maxlength[x] for x in words[w])
    else:
        maxlength[w] = 0

# Print the words in all chains for each of the top 10 words
toshow = sorted(words, key=lambda x: maxlength[x], reverse=True)[:10]
while toshow:
    w = toshow[0]
    print(w, [(x, maxlength[x]) for x in words[w]])
    toshow = toshow[1:] + list(x for x in words[w] if x not in toshow)

我字典中最长的单词链是：

abranchiate
branchiate
branchiae
branchia
branchi
branch
ranch
rach
ach
ah
a

回答于 2025-04-16 由 Python大师

分享举报

Python- 哪个词可以去掉最多连续字母仍然是有效词？

3 个回答

撰写回答