如何使用NLTK从归纳语法生成句子？

20 投票

5 回答

19657 浏览

提问于 2025-04-17 16:40

我有一个（很大的）解析句子的列表，这些句子是用斯坦福解析器解析出来的。例如，句子“现在你可以娱乐自己”有如下的树形结构：

(ROOT
  (S
    (ADVP (RB Now))
    (, ,)
    (NP (PRP you))
    (VP (MD can)
      (VP (VB be)
        (VP (VBN entertained))))
    (. .)))

我正在使用这组句子树来通过nltk生成语法：

import nltk

# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()

# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)

现在我想用这个grammar来生成新的随机句子。我希望因为这个语法是从一组特定的例子中学习来的，所以生成的句子在意义上会相似。我可以在nltk中做到这一点吗？

如果我不能用nltk来实现这个功能，还有没有其他工具可以使用（可能需要重新格式化的）grammar来生成句子呢？

树形结构自然语言处理 nltk 句子生成斯坦福解析器归纳语法随机句子

5 个回答

我用来从现有的nltk.CFG语法生成随机句子的方法：

def generate_sample(grammar, prod, frags):        
    if prod in grammar._lhs_index: # Derivation
        derivations = grammar._lhs_index[prod]            
        derivation = random.choice(derivations)            
        for d in derivation._rhs:            
            generate_sample(grammar, d, frags)
    elif prod in grammar._rhs_index:
        # terminal
        frags.append(str(prod))

现在可以使用这个方法了：

frags = []  
generate_sample(grammar, grammar.start(), frags)
print( ' '.join(frags) )

回答于 2025-04-17 由 Python大师

分享举报

首先，如果你生成随机句子，它们可能在语法上是正确的，但很可能会失去原本的意思。

这让我想起麻省理工学院的学生们做的那个SCIgen程序，它可以自动生成科学论文。顺便说一句，这个项目真的很有趣。

不过，我自己从来没有做过这件事，但听起来用nltk.bigrams是可行的。你可以去看看这里，在使用Bigrams生成随机文本的部分。

你还可以生成当前树的所有子树，不过我不确定这是否是你想要的。

回答于 2025-04-17 由 Python大师

分享举报

在NLTK 2.0中，你可以使用 nltk.parse.generate 来生成给定语法的所有可能的句子。

这段代码定义了一个函数，它会根据一个（P）CFG中的生成规则来生成一个句子。

# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
    frags = []
    if len(items) == 1:
        if isinstance(items[0], Nonterminal):
            for prod in grammar.productions(lhs=items[0]):
                frags.append(generate_sample(grammar, prod.rhs()))
        else:
            frags.append(items[0])
    else:
        # This is where we need to make our changes
        chosen_expansion = choice(items)
        frags.append(generate_sample,chosen_expansion)
    return frags

如果你想在你的PCFG中使用权重，显然你需要使用比 choice() 更好的抽样方法，因为 choice() 默认假设当前节点的所有扩展都是等可能的。

回答于 2025-04-17 由 Python大师

分享举报

如何使用NLTK从归纳语法生成句子？

5 个回答

撰写回答