标准化gensim方法提供的相似性分数most_SIMPLE_cosmul

2024-06-08 22:45:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试决定是否在一个项目中使用gensim方法most_simular()和most_simular_cosmul(),在这个项目中,我试图找到一组类似于输入列表的单词

虽然这两种方法都提供了一个参数来限制结果集中的单词数量(默认值:10),但我尝试根据相似性阈值(比如0.5)来选择单词。如果导致相似性的肯定词数量较少,则这两种方法之间的效果很好

然而,尽管常规的most_-similar()方法(线性)为每个输出对返回相对稳定的相似性分数,而与输入字的数量无关,但most_-similar_-cosmul()方法(乘法)返回的相似性越来越小

我想知道是否有任何方法可以规范化最相似的cosmul方法提供的相似性,从而使它们独立于输入单词的数量。最相似的cosmul方法是基于Levy和Yoav Goldberg的论文《稀疏和显式词表示中的语言规律》

>>> print(model.wv.most_similar(['digital_transformation','digital']))
[('social_mobile', 0.6574317216873169), ('social_networking', 0.653410792350769), ('mobile', 0.6508731245994568), ('mobile_social', 0.6483871340751648), ('digitization', 0.6388466358184814), ('digital_platform', 0.6366733908653259), ('digitalization', 0.6243988871574402), ('omni-channel', 0.6230252385139465), ('multi-channel', 0.6205648183822632), ('digital_marketing', 0.6161972284317017)]

>>> print(model.wv.most_similar_cosmul(['digital_transformation','digital']))
[('social_mobile', 0.6048797965049744), ('social_networking', 0.6038507223129272), ('mobile_social', 0.5969080328941345), ('digitization', 0.594704270362854), ('mobile', 0.5940128564834595), ('digital_platform', 0.5880148410797119), ('digitalization', 0.585797905921936), ('omni-channel', 0.5844268798828125), ('multi-channel', 0.5779333710670471), ('digital_marketing', 0.575140118598938)]



>>> print(model.wv.most_similar(['digital_transformation','digital','virtual']))
[('mobile', 0.6831567883491516), ('social_networking', 0.6692748665809631), ('social_mobile', 0.6688156127929688), ('cloud-based', 0.6612646579742432), ('cloud', 0.6573182344436646), ('mobile_social', 0.656197190284729), ('connected', 0.6240546107292175), ('physical_digital', 0.6219688653945923), ('digital_platform', 0.6217926740646362), ('digital_content', 0.6217813491821289)]

>>> print(model.wv.most_similar_cosmul(['digital_transformation','digital','virtual']))
[('mobile', 0.4431973695755005), ('social_networking', 0.4394254982471466), ('social_mobile', 0.43778157234191895), ('cloud', 0.4324479103088379), ('cloud-based', 0.43241411447525024), ('mobile_social', 0.4277876913547516), ('connected', 0.4095992147922516), ('physical_digital', 0.40511685609817505), ('digital_content', 0.40396806597709656), ('digital_platform', 0.40359607338905334)]



>>> print(model.wv.most_similar(['digital_transformation','digital','virtual', 'online']))
[('mobile', 0.7285350561141968), ('online_mobile', 0.6857519745826721), ('social_networking', 0.6757094264030457), ('mobile_social', 0.6697002053260803), ('digital_platform', 0.6669844388961792), ('social_mobile', 0.6654024720191956), ('digital_content', 0.6573283076286316), ('physical_digital', 0.6509564518928528), ('multi-channel', 0.6482810974121094), ('cloud-based', 0.64772629737854)]

>>> print(model.wv.most_similar_cosmul(['digital_transformation','digital','virtual','online']))
[('mobile', 0.35193151235580444), ('social_networking', 0.3212329149246216), ('online_mobile', 0.31998541951179504), ('mobile_social', 0.31541353464126587), ('social_mobile', 0.3134937584400177), ('digital_platform', 0.31218039989471436), ('digital_content', 0.3066185712814331), ('physical_digital', 0.3035270869731903), ('cloud-based', 0.301998496055603), ('multi-channel', 0.30137890577316284)]

Tags: 方法cloudmostmodelchannelsocialmobilenetworking