使用LSTM教程代码预测句子中的下一个单词？

3条回答

网友

1楼 · 编辑于 2024-04-26 09:29:23

有很多问题，我想澄清其中的一些。

how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?

这里的关键是，下一代词汇实际上是词汇中的词汇分类。所以你需要一个分类器，这就是为什么在输出中有一个softmax。

其原理是，在每一个时间步，模型将根据最后一个单词的嵌入和前一个单词的内部记忆输出下一个单词。tf.contrib.rnn.static_rnn自动将输入合并到内存中，但我们需要提供最后一个单词的嵌入并对下一个单词进行分类。

我们可以使用一个预先训练的word2vec模型，只需将embedding矩阵与预先训练的矩阵初始化。为了简单起见，我认为本教程使用了随机矩阵。内存大小与嵌入大小无关，可以使用较大的内存大小来保留更多信息。

这些教程是高级的。如果您想深入了解细节，我建议您查看纯python/numpy的源代码。

网友

2楼 · 编辑于 2024-04-26 09:29:23

My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence?
I.e. I'm trying to write a function with the signature: getNextWord(model, sentencePrefix)

在我解释我的答案之前，先说说你对# Call static_rnn(cell) once for each word in prefix to initialize state的建议：请记住static_rnn返回的不是像numpy数组那样的值，而是一个张量。当张量在会话中运行时（1）（会话保持计算图的状态，包括模型参数的值）和（2）（使用计算张量值所需的输入），可以将其计算为一个值。可以使用输入阅读器（本教程中的方法）或占位符（我将在下面使用）提供输入。

下面是实际答案：本教程中的模型旨在从文件中读取输入数据。@user3080953的答案已经展示了如何使用您自己的文本文件，但据我所知，您需要更多地控制如何将数据输入模型。为此，您需要定义自己的占位符，并在调用session.run()时将数据馈送给这些占位符。

在下面的代码中，我对PTBModel进行了子类化，并使其负责显式地向模型提供数据。我介绍了一个特殊的PTBInteractiveInput，它有一个类似于PTBInput的接口，因此您可以重用PTBModel中的功能。为了训练你的模型，你仍然需要PTBModel。

class PTBInteractiveInput(object):
  def __init__(self, config):
    self.batch_size = 1
    self.num_steps = config.num_steps
    self.input_data = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])
    self.sequence_len = tf.placeholder(dtype=tf.int32, shape=[])
    self.targets = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])

class InteractivePTBModel(PTBModel):

  def __init__(self, config):
    input = PTBInteractiveInput(config)
    PTBModel.__init__(self, is_training=False, config=config, input_=input)
    output = self.logits[:, self._input.sequence_len - 1, :]
    self.top_word_id = tf.argmax(output, axis=2)

  def get_next(self, session, prefix):
    prefix_array, sequence_len = self._preprocess(prefix)
    feeds = {
      self._input.sequence_len: sequence_len,
      self._input.input_data: prefix_array,
    }
    fetches = [self.top_word_id]
    result = session.run(fetches, feeds)
    self._postprocess(result)

  def _preprocess(self, prefix):
    num_steps = self._input.num_steps
    seq_len = len(prefix)
    if seq_len > num_steps:
      raise ValueError("Prefix to large for model.")
    prefix_ids = self._prefix_to_ids(prefix)
    num_items_to_pad = num_steps - seq_len
    prefix_ids.extend([0] * num_items_to_pad)
    prefix_array = np.array([prefix_ids], dtype=np.float32)
    return prefix_array, seq_len

  def _prefix_to_ids(self, prefix):
    # should convert your prefix to a list of ids
    pass

  def _postprocess(self, result):
    # convert ids back to strings
    pass

在PTBModel函数的__init__中，需要添加以下行：

self.logits = logits

Why use a random (uninitialized, untrained) word-embedding?

首先要注意，虽然嵌入在开始时是随机的，但是它们将与网络的其他部分一起接受训练。训练后获得的嵌入与word2vec模型获得的嵌入具有相似的特性，例如，能够用向量运算（king-man+woman=queen，在任务中，如果你有大量的训练数据，比如语言建模（不需要注释的训练数据）或者神经机器翻译，从头开始训练嵌入就更为常见。

Why use softmax?

Softmax是一个函数，它将相似度得分向量（logits）规范化为概率分布。您需要一个概率分布来训练具有交叉熵损失的模型，并能够从模型中取样。请注意，如果您只对训练模型中最可能出现的单词感兴趣，则不需要softmax，您可以直接使用logits。

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

不，原则上可以是任何价值。但是，使用维度低于嵌入维度的隐藏状态并没有多大意义。

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

下面是使用给定的numpy数组初始化嵌入的自包含示例。如果希望嵌入在训练期间保持固定/不变，请将trainable设置为False。

import tensorflow as tf
import numpy as np
vocab_size = 10000
size = 200
trainable=True
embedding_matrix = np.zeros([vocab_size, size]) # replace this with code to load your pretrained embedding
embedding = tf.get_variable("embedding",
                            initializer=tf.constant_initializer(embedding_matrix),
                            shape=[vocab_size, size],
                            dtype=tf.float32,
                            trainable=trainable)

网友

3楼 · 编辑于 2024-04-26 09:29:23

主要问题

加载单词

加载自定义数据而不是使用测试集：

reader.py@ptb_raw_data

test_path = os.path.join(data_path, "ptb.test.txt")
test_data = _file_to_word_ids(test_path, word_to_id)  # change this line

test_data应该包含单词id（对于映射，打印出word_to_id）。举个例子，它应该看起来像：[152562246]。。。

显示预测

我们需要在调用sess.run时返回FC层的输出（logits）

ptb_word_lm.py@PTBModel.__init__

    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
    self.top_word_id = tf.argmax(logits, axis=2)  # add this line

ptb_word_lm.py@run_epoch

  fetches = {
      "cost": model.cost,
      "final_state": model.final_state,
      "top_word_id": model.top_word_id # add this line
  }

在函数的后面，vals['top_word_id']将有一个整数数组，其ID为最上面的单词。在word_to_id中查找此项以确定预测的单词。我刚才用这个小模型做了这个，前1名的准确率很低（20-30%iirc），尽管困惑是在头球预测的。

子问题

Why use a random (uninitialized, untrained) word-embedding?

你得问问作者，但在我看来，训练嵌入使这更像是一个独立的教程：它不是把嵌入当作一个黑匣子，而是展示了它是如何工作的。

Why use softmax?

最后的预测是由隐藏层输出的余弦相似性决定的。在LSTM之后有一个FC层，它将嵌入的状态转换为最后一个单词的一个热编码。

下面是神经网络的操作和尺寸示意图：

word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)

Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)

从技术上讲，不是。如果你看LSTM方程，你会注意到x（输入）可以是任何大小，只要适当调整权重矩阵。

How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?

我不知道，对不起。

主要问题

加载单词

显示预测

子问题

相关问题更多 >

编程相关推荐

热门问题

热门文章