如何递归生成多词术语？

6 投票

4 回答

724 浏览

数据工程师

提问于 2025-04-15 11:04

假设我有一串单词：'a b c d e f'。我想从这串单词中生成一些多词组合。

单词的顺序很重要。比如说，'f e d'这样的组合是不能从上面的例子中生成的。

补充说明：而且，单词不能被跳过。像'a c'或者'b d f'这样的组合也不应该被生成。

我现在的代码：

doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
    terms.append(word)
    if one_before:
        terms.append(' '.join([one_before, word]))
    if two_before:
        terms.append(' '.join([two_before, one_before, word]))
    two_before = one_before
    one_before = word

for term in terms:
    print term

输出结果：

a
b
a b
c
b c
a b c
d
c d
b c d
e
d e
c d e
f
e f
d e f

我该如何把这个变成一个递归函数，这样我就可以设置每个组合的最大单词数量了？

应用场景：

我会用这个来从HTML文档中的可读文本生成多词组合。我的总体目标是对一个大数据集（大约两百万个文档）进行潜在语义分析。这就是为什么单词顺序很重要（涉及自然语言处理等内容）。

数据处理递归 html解析自然语言处理组合算法多词组合语义分析文本生成

4 个回答

你为什么要这样做呢？其实你可以直接用一个for循环和itertools.combinations()来实现。

回答于 2025-04-15 由 Python大师

分享举报

我建议你把你的函数改成一个生成器，这样就可以生成你需要的项数了。你需要把 print 改成 yield（当然整个代码块也要变成一个函数）。

你也可以看看 itertools 这个模块，它对你正在做的工作非常有帮助。

回答于 2025-04-15 由 Python大师

分享举报

这不是递归的，但我觉得它能满足你的需求。

doc = 'a b c d e f'
words = doc.split(None)
max = 3          


for index in xrange(len(words)):    
    for n in xrange(max):
        if index + n < len(words):           
            print ' '.join(words[index:index+n+1])

下面是一个递归的解决方案：

def find_terms(words, max_words_per_term):       
    if len(words) == 0: return []
    return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term)


doc = 'a b c d e f'
words = doc.split(None) 
for term in find_terms(words, 3):
    print term

这里再次展示了递归函数，并加了一些解释性的变量和注释。

def find_terms(words, max_words_per_term):   

    # If there are no words, you've reached the end. Stop.    
    if len(words) == 0:
        return []      

    # What's the max term length you could generate from the remaining 
    # words? It's the lesser of max_words_per_term and how many words 
    # you have left.                                                         
    max_term_len = min(len(words), max_words_per_term)       

    # Find all the terms that start with the first word.
    initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)]

    # Here's the recursion. Find all of the terms in the list 
    # of all but the first word.
    other_terms = find_terms(words[1:], max_words_per_term)

    # Now put the two lists of terms together to get the answer.
    return initial_terms + other_terms

回答于 2025-04-15 由 Python大师

分享举报

如何递归生成多词术语？

4 个回答

撰写回答