Word2vec向量的长度有什么意义?

2024-06-06 05:02:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Word2vec通过gensim与谷歌的预先训练向量谷歌新闻。我注意到,通过对Word2Vec对象执行直接索引查找可以访问的字向量不是单位向量:

>>> import numpy
>>> from gensim.models import Word2Vec
>>> w2v = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> king_vector = w2v['king']
>>> numpy.linalg.norm(king_vector)
2.9022589

但是,在^{}方法中,不使用这些非单位向量;而是使用来自未记录的.syn0norm属性的规范化版本,该属性只包含单位向量:

>>> w2v.init_sims()
>>> unit_king_vector = w2v.syn0norm[w2v.vocab['king'].index]
>>> numpy.linalg.norm(unit_king_vector)
0.99999994

较大的矢量只是单位矢量的放大版本:

>>> king_vector - numpy.linalg.norm(king_vector) * unit_king_vector
array([  0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
         0.00000000e+00,  -1.86264515e-09,   0.00000000e+00,
        -7.45058060e-09,   0.00000000e+00,   3.72529030e-09,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        ... (some lines omitted) ...
        -1.86264515e-09,  -3.72529030e-09,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00], dtype=float32)

鉴于Word2Vec中的单词相似度比较是由cosine similarity完成的,所以我并不清楚非正规化向量的长度意味着什么——尽管我假设它们的意思是某物,因为gensim将它们公开给我,而不是只公开.syn0norm中的单位向量。

这些非标准化的Word2vec向量的长度是如何生成的,它们的含义是什么?对于什么样的计算,使用正规化向量是有意义的,什么时候应该使用非正规化向量?


Tags: importnumpynorm属性unitword2vec向量vector
2条回答

我会为事先的罗嗦道歉。

词嵌入模型的目标函数是最大化模型下的数据日志似然。在word2vec中,这是通过最小化给定单词上下文的单词的预测向量(使用上下文)和实际向量(当前表示)的dot product(用softmax规范化)来实现的。

注意,训练单词向量的任务要么是预测上下文给定的单词,要么是单词给定的上下文(skip-gramvscbow)。单词向量的长度本身没有意义,但是向量本身具有有趣的特性/应用。

要查找相似的词,可以查找具有最大余弦相似性的词(相当于在对向量进行单位规格化后查找具有最小欧氏距离的词,请检查link),而^{}函数正在执行此操作。

要找到类比,我们可以简单地使用词向量的原始向量表示之间的差(或方向)向量。例如

  • v('巴黎')-v('法国')~v('罗马')-v('意大利')`
  • v('好')-v('坏')~v(快乐)-v('悲伤')

gensim

model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['good', 'sad'], negative=['bad'])
[(u'wonderful', 0.6414928436279297),
 (u'happy', 0.6154338121414185),
 (u'great', 0.5803680419921875),
 (u'nice', 0.5683973431587219),
 (u'saddening', 0.5588893294334412),
 (u'bittersweet', 0.5544661283493042),
 (u'glad', 0.5512036681175232),
 (u'fantastic', 0.5471092462539673),
 (u'proud', 0.530515193939209),
 (u'saddened', 0.5293528437614441)]

参考文献:

  1. GloVe:单词表示的全局向量
  2. word2vec参数学习说明-paper
  3. 连续空间词表示的语言规律
  4. Word2Vec

将答案复制到相关的(尚未答复的question

我认为你正在寻找的答案是阿德里安·沙克尔和本杰明·威尔逊在2015年的论文Measuring Word Significance using Distributed Representations of Words中描述的。要点:

When a word appears in different contexts, its vector gets moved in different directions during updates. The final vector then represents some sort of weighted average over the various contexts. Averaging over vectors that point in different directions typically results in a vector that gets shorter with increasing number of different contexts in which the word appears. For words to be used in many different contexts, they must carry little meaning. Prime examples of such insignificant words are high-frequency stop words, which are indeed represented by short vectors despite their high term frequencies ...


For given term frequency, the vector length is seen to take values only in a narrow interval. That interval initially shifts upwards with increasing frequency. Around a frequency of about 30, that trend reverses and the interval shifts downwards.

...

Both forces determining the length of a word vector are seen at work here. Small-frequency words tend to be used consistently, so that the more frequently such words appear, the longer their vectors. This tendency is reflected by the upwards trend in Fig. 3 at low frequencies. High-frequency words, on the other hand, tend to be used in many different contexts, the more so, the more frequently they occur. The averaging over an increasing number of different contexts shortens the vectors representing such words. This tendency is clearly reflected by the downwards trend in Fig. 3 at high frequencies, culminating in punctuation marks and stop words with short vectors at the very end.

...

Graph showing the trend described in the previous excerpt

Figure 3: Word vector length v versus term frequency tf of all words in the hep-th vocabulary. Note the logarithmic scale used on the frequency axis. The dark symbols denote bin means with the kth bin containing the frequencies in the interval [2k−1, 2k − 1] with k = 1, 2, 3, . . .. These means are included as a guide to the eye. The horizontal line indicates the length v = 1.37 of the mean vector


4 Discussion

Most applications of distributed representations of words obtained through word2vec so far centered around semantics. A host of experiments have demonstrated the extent to which the direction of word vectors captures semantics. In this brief report, it was pointed out that not only the direction, but also the length of word vectors carries important information. Specifically, it was shown that word vector length furnishes, in combination with term frequency, a useful measure of word significance.

相关问题 更多 >