代码:
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print doc[0],doc[2],doc[6],doc[8]
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))
结果:
^{pr2}$Documentation of spaCy表示相似度越高,返回的值越高,但苹果和橘子的相似度为0。 为什么?在
下面的代码解释了为什么相似度计算不正确。这是由于矢量计算不正确:
doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
ans = 0
sa, sb = 0, 0
for i in range(len(a)):
ans += a[i]*b[i]
sa += a[i]*a[i]
sb += b[i]*b[i]
sa = sa**0.5
sb = sb**0.5
return ans/(sa*sb)
print doc[0], doc[2], doc[4], doc[8]
print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector, doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector, doc[8].vector)
print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]), doc[0].similarity(doc[8]), doc[4].similarity(doc[8])
输出:
apples apple orange oranges
0.750411317806 0.51238496547 nan nan #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0 #token.simlarity()
doc[8].vector
都是零。那么,为什么‘橘子’代币的向量计算成all-0?
“orange”和“apple”的向量计算正确。更重要的是,“苹果”的向量也计算正确。那么,为什么“橘子”是个问题?在
“河马”这个词是零向量(因为“河马”这个词是向量2)
您可以通过打印此令牌的矢量来检查:
打印(桔子.vector) 打印(河马矢量)在
相关问题 更多 >
编程相关推荐