寻找茎和尾的组合

2024-06-11 07:46:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有“词干”和“结尾”的映射(可能不是正确的单词):

all_endings = {
 'birth': set(['place', 'day', 'mark']), 
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

当然,要长得多。我还反过来做了相应的字典:

^{pr2}$

我现在尝试着想出一个算法来找到两个或多个匹配两个或多个“结尾”的“茎”的匹配。例如,在上面,它将与锁和裂缝匹配,从而导致

lockdown
lockup
crackdown
crackup

但不包括'upvote', 'downfall' or 'locksmith'(这是我最大的问题)。我得到的误报如下:

pancake
cupcake
cupboard

但我只是在绕圈子。(双关语的意思)而且我似乎一事无成。我会感激你朝正确的方向踢。在

到目前为止,代码混乱且无用,您可能应该忽略这些代码:

findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)

Tags: inforifhavematchendingallthis
3条回答

我认为我避免这些误报的方法是删除词干交叉处没有单词的候选词-如果这有意义的话:(

请看一下,如果我遗漏了什么,请告诉我。在

#using all_stems and all_endings from the question

#this function is declared at the end of this answer
two_or_more_stem_combinations = get_stem_combinations(all_stems)
print "two_or_more_stem_combinations", two_or_more_stem_combinations
#this print shows ... [set(['lock', 'crack'])] 

for request in two_or_more_stem_combinations:
    #we filter the initial index to only look for sets or words in the request
    candidates = filter(lambda x: x[0] in request, all_endings.items())

    #intersection of the words for the request
    words = candidates[0][1]
    for c in  candidates[1:]:
        words=words.intersection(c[1])

    #it's handy to have it in a dict
    candidates = dict(candidates)

    #we need to remove those that do not contain 
    #any words after the intersection of stems of all the candidates
    candidates_to_remove = set()
    for c in candidates.items():
        if len(c[1].intersection(words)) == 0:
        candidates_to_remove.add(c[0])

    for key in candidates_to_remove:
        del candidates[key]

    #now we know what to combine
    for c in candidates.keys():
       print "combine", c , "with", words 

输出:

combine lock with set(['down', 'up'])

combine crack with set(['down', 'up'])

正如您所看到的,这个解决方案不包含那些误报。在

编辑:复杂性

在最坏的情况下,这个解决方案的复杂性不会比O(3n)更糟糕——如果不考虑访问字典的话。以及 对于大多数执行,第一个过滤器缩小了相当多的解决方案空间。在

编辑:获取茎

此函数基本上递归地研究字典all_stems,并找到两个或多个词干重合的两个或多个词尾的组合。在

^{pr2}$

它不是特别漂亮,但是如果您将字典分解为两个列表,并使用显式索引,这将非常简单:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

输出:

^{pr2}$

基本上,我们所做的就是依次迭代每个集合,并将其与剩余的每个集合进行比较,看看是否有两个以上的匹配。我们永远不必尝试比当前集合早的集合,因为它们已经在先前的迭代中进行过比较。其余的(索引等)只是记账。在

==编辑答案:==

好吧,这是另一个迭代,让你考虑我第一次提到的错误。实际上,结果是代码变得更加简短和简单。combinations的文档说“如果输入元素是唯一的,那么每个组合中就没有重复值”,因此它应该只形成和测试最小的交集数。似乎也没有必要确定endings_by_stems。在

from itertools import combinations

MINMATCH = 2
print 'all words with at least', MINMATCH, 'endings in common:'
for (word0,word1) in combinations(stems_by_endings, 2):
    ending_words0 = stems_by_endings[word0]
    ending_words1 = stems_by_endings[word1]
    common_endings = ending_words0 & ending_words1
    if len(common_endings) >= MINMATCH:
        for stem in common_endings:
            print ' ', stem+word0
            print ' ', stem+word1

# all words with at least 2 endings in common:
#   lockdown
#   lockup
#   falldown
#   fallup
#   crackdown
#   crackup

==上一个答案==

在这里,我找到了一个最常用的词干,但是我首先尝试了用最短的词尾来计算词干的数量。在

在最后阶段,它会打印出所有可能的词干+词尾的组合,这些词干+词尾符合标准。我尽量使所有变量名尽可能具有描述性,以便于理解。;—)我也省略了all_endings' and 'all+stems的定义。在

^{pr2}$

相关问题 更多 >