我正在阅读tensorflow的word2vec教程:https://www.tensorflow.org/tutorials/text/word2vec#define_loss_function_and_compile_model
在本教程中,使用tf.random.log_uniform_candidate_sampler
进行否定性采样。给定上下文类(true类),目标是从整个词汇表中抽取否定类。根据我的理解,否定类必须不同于给定的上下文类。但是,我发现上下文类可能出现在tf.random.log_uniform_candidate_sampler
采样的负类中。代码如下:
import tensorflow as tf
SEED = 42
# encode the words
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
vocab_size = len(vocab)
print(vocab)
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)
# make (hot, the) as a context pair
target_word, context_word = 6, 1
print("target: {}, context: {}".format(inverse_vocab[target_word], inverse_vocab[context_word]))
# negative sampling
# Set the number of negative samples per positive context.
num_ns = 4
context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class, # class that should be sampled as 'positive'
num_true=1, # each positive skip-gram has 1 positive context class
num_sampled=num_ns, # number of negative context words to sample
unique=True, # all the negative samples should be unique
range_max=vocab_size, # pick index of the samples from [0, vocab_size]
seed=SEED, # seed for reproducibility
name="negative_sampling" # name of this operation
)
print("negative samples\' index", negative_sampling_candidates)
print("negetive samples: ", [inverse_vocab[index.numpy()] for index in negative_sampling_candidates])
# "the" will show in negative samples, if not, run it several times.
单词the
是单词hot
的上下文类,为什么它可以显示在采样的否定类中?此外,目标词hot
也可以作为负类进行采样。我误解了什么吗
你说得对。Tensorflow犯了一个错误。请参阅关于https://github.com/tensorflow/tensorflow/issues/49490的错误报告
相关问题 更多 >
编程相关推荐