寻找茎和尾的组合

all_endings = { 'birth': set(['place', 'day', 'mark']), 'snow': set(['plow', 'storm', 'flake', 'man']), 'shoe': set(['lace', 'string', 'maker']), 'lock': set(['down', 'up', 'smith']), 'crack': set(['down', 'up',]), 'arm': set(['chair']), 'high': set(['chair']), 'over': set(['charge']), 'under': set(['charge']), }

findings = defaultdict(set) for stem, endings in all_endings.items(): # What stems have matching endings: for ending in endings: otherstems = all_stems[ending] if not otherstems: continue for otherstem in otherstems: # Find endings that also exist for other stems otherendings = all_endings[otherstem].intersection(endings) if otherendings: # Some kind of match findings[stem].add(otherstem) # Go through this in order of what is the most stems that match: MINMATCH = 2 for match in sorted(findings.values(), key=len, reverse=True): for this_stem in match: other_stems = set() # Stems that have endings in common with this_stem other_endings = set() # Endings this stem have in common with other stems this_endings = all_endings[this_stem] for this_ending in this_endings: for other_stem in all_stems[this_ending] - set([this_stem]): matching_endings = this_endings.intersection(all_endings[other_stem]) if matching_endings: other_endings.add(this_ending) other_stems.add(other_stem) stem_matches = all_stems[other_endings.pop()] for other in other_endings: stem_matches = stem_matches.intersection(all_stems[other]) if len(stem_matches) >= MINMATCH: for m in stem_matches: for e in all_endings[m]: print(m+e)

3条回答

网友

1楼 · 编辑于 2024-06-11 07:46:28

我认为我避免这些误报的方法是删除词干交叉处没有单词的候选词-如果这有意义的话：（

请看一下，如果我遗漏了什么，请告诉我。在

#using all_stems and all_endings from the question

#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])] 

for request in two_or_more_stem_combinations:
    #we filter the initial index to only look for sets or words in the request
    candidates = filter(lambda x: x[0] in request, all_endings.items())

    #intersection of the words for the request
    words = candidates[0][1]
    for c in  candidates[1:]:
        words=words.intersection(c[1])

    #it's handy to have it in a dict
    candidates = dict(candidates)

    #we need to remove those that do not contain 
    #any words after the intersection of stems of all the candidates
    candidates_to_remove = set()
    for c in candidates.items():
        if len(c[1].intersection(words)) == 0:
        candidates_to_remove.add(c[0])

    for key in candidates_to_remove:
        del candidates[key]

    #now we know what to combine
    for c in candidates.keys():
       print "combine", c , "with", words

输出：

combine lock with set(['down', 'up'])
combine crack with set(['down', 'up'])

正如您所看到的，这个解决方案不包含那些误报。在

编辑：复杂性

在最坏的情况下，这个解决方案的复杂性不会比O（3n）更糟糕——如果不考虑访问字典的话。以及对于大多数执行，第一个过滤器缩小了相当多的解决方案空间。在

编辑：获取茎

此函数基本上递归地研究字典all_stems，并找到两个或多个词干重合的两个或多个词尾的组合。在

^{pr2}$

网友

2楼 · 编辑于 2024-06-11 07:46:28

它不是特别漂亮，但是如果您将字典分解为两个列表，并使用显式索引，这将非常简单：

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

输出：

^{pr2}$

基本上，我们所做的就是依次迭代每个集合，并将其与剩余的每个集合进行比较，看看是否有两个以上的匹配。我们永远不必尝试比当前集合早的集合，因为它们已经在先前的迭代中进行过比较。其余的（索引等）只是记账。在

网友

3楼 · 编辑于 2024-06-11 07:46:28

==编辑答案：==

好吧，这是另一个迭代，让你考虑我第一次提到的错误。实际上，结果是代码变得更加简短和简单。combinations的文档说“如果输入元素是唯一的，那么每个组合中就没有重复值”，因此它应该只形成和测试最小的交集数。似乎也没有必要确定endings_by_stems。在

from itertools import combinations

MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
    ending_words0 = stems_by_endings[word0]
    ending_words1 = stems_by_endings[word1]
    common_endings = ending_words0 & ending_words1
    if len(common_endings) >= MINMATCH:
        for stem in common_endings:
            print ' ', stem+word0
            print ' ', stem+word1

# all words with at least 2 endings in common:
#   lockdown
#   lockup
#   falldown
#   fallup
#   crackdown
#   crackup

==上一个答案==

在这里，我找到了一个最常用的词干，但是我首先尝试了用最短的词尾来计算词干的数量。在

在最后阶段，它会打印出所有可能的词干+词尾的组合，这些词干+词尾符合标准。我尽量使所有变量名尽可能具有描述性，以便于理解。；—）我也省略了all_endings' and 'all+stems的定义。在

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章