TensorFlow: 词向量 CBOW 模型

3条回答

网友

1楼 · 编辑于 2024-04-28 15:05:39

基本上，是的：

对于给定的文本the quick brown fox jumped over the lazy dog:，窗口大小1的CBOW实例将是

([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...

网友

2楼 · 编辑于 2024-04-28 15:05:39

我认为CBOW模型不能简单地通过在Skip-gram中翻转train_inputs和train_labels来实现，因为CBOW模型架构使用周围单词的向量之和作为分类器预测的单个实例。E、例如，应该一起使用[the, brown]来预测quick，而不是使用the来预测quick。

要实现CBOW，您必须编写一个新的generate_batch生成器函数，并在应用logistic回归之前汇总周围单词的向量。我写了一个例子，你可以参考：https://github.com/wangz10/tensorflow-playground/blob/master/word2vec.py#L105

网友

3楼 · 编辑于 2024-04-28 15:05:39

对于CBOW，只需要更改代码的几个部分word2vec_basic.py。总的来说，培训结构和方法是一样的。

我应该在word2vec_basic.py中更改哪些部分？

1）生成训练数据对的方式。因为在CBOW中，你预测的是中心词，而不是上下文词。

generate_batch的新版本将是

def generate_batch(batch_size, bag_window):
  global data_index
  span = 2 * bag_window + 1 # [ bag_window target bag_window ]
  batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size):
    # just for testing
    buffer_list = list(buffer)
    labels[i, 0] = buffer_list.pop(bag_window)
    batch[i] = buffer_list
    # iterate to the next buffer
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

那么CBOW的新训练数据将是

data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the']

#with bag_window = 1:
    batch: [['anarchism', 'as'], ['originated', 'a'], ['as', 'term'], ['a', 'of']]
    labels: ['originated', 'as', 'a', 'term']

与Skip gram的数据相比

#with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term', 'of', 'of', 'abuse', 'abuse', 'first', 'first', 'used', 'used']
    labels: ['as', 'anarchism', 'originated', 'a', 'term', 'as', 'a', 'of', 'term', 'abuse', 'of', 'first', 'used', 'abuse', 'against', 'first']

2）因此您还需要更改可变形状

train_dataset = tf.placeholder(tf.int32, shape=[batch_size])

到

train_dataset = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])

3）损失函数

 loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(
  weights = softmax_weights, biases = softmax_biases, inputs = tf.reduce_sum(embed, 1), labels = train_labels, num_sampled= num_sampled, num_classes= vocabulary_size))

注意inputs=tf.reduce_sum（embed，1）正如Zichen Wang提到的。

就这样！

相关问题更多 >

编程相关推荐

热门问题

热门文章