为什么tf.random.log\u uniform\u candidate\u sampler给出了真正的类?

2024-06-11 07:56:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在阅读tensorflow的word2vec教程:https://www.tensorflow.org/tutorials/text/word2vec#define_loss_function_and_compile_model

在本教程中,使用tf.random.log_uniform_candidate_sampler进行否定性采样。给定上下文类(true类),目标是从整个词汇表中抽取否定类。根据我的理解,否定类必须不同于给定的上下文类。但是,我发现上下文类可能出现在tf.random.log_uniform_candidate_sampler采样的负类中。代码如下:

import tensorflow as tf
SEED = 42 

# encode the words
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token 
for token in tokens:
  if token not in vocab: 
    vocab[token] = index
    index += 1
vocab_size = len(vocab)
print(vocab)
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)


# make (hot, the) as a context pair
target_word, context_word = 6, 1
print("target: {}, context: {}".format(inverse_vocab[target_word], inverse_vocab[context_word]))


# negative sampling
# Set the number of negative samples per positive context. 
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class, # class that should be sampled as 'positive'
    num_true=1, # each positive skip-gram has 1 positive context class
    num_sampled=num_ns, # number of negative context words to sample
    unique=True, # all the negative samples should be unique
    range_max=vocab_size, # pick index of the samples from [0, vocab_size]
    seed=SEED, # seed for reproducibility
    name="negative_sampling" # name of this operation
)
print("negative samples\' index", negative_sampling_candidates)
print("negetive samples: ", [inverse_vocab[index.numpy()] for index in negative_sampling_candidates])
# "the" will show in negative samples, if not, run it several times.

单词the是单词hot的上下文类,为什么它可以显示在采样的否定类中?此外,目标词hot也可以作为负类进行采样。我误解了什么吗


Tags: theintokenforindextfcontextword