pyspark.ml Word2Vec是否生成令牌表示或文档表示?

2024-06-11 21:53:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在一些消息上从pyspark.ml运行word2vec模型。但是,我不知道word2vec的输出是嵌入式空间中每个令牌的表示,还是整个消息的表示

我试着在下面解释更多细节。要训练我使用的模型:

# First I do tokenization and some pre-processing of the raw messages
word2vec = Word2Vec(vectorSize = 200, minCount = 1, inputCol = 'stop_token_1', outputCol = 'message_vector')
train_data = sub1.select("stop_token_1")
model = word2vec.fit(train_data)

其中,stop_token_1是每个消息的令牌数组

结果是每条消息都有一个200维的向量,所以我不知道嵌入是按消息还是按令牌进行的,然后以某种方式“组合”以获得整个消息的表示

例如

>>>raw_message
'TRANSFER  globus_ftp_client: the server responded with an error 451 Space reservation 229317465 may not be used for this write request $NET_PARAMS'

>>>stop_token_1
['transfer', 'globus_ftp_client', 'server', 'responded', 'error', '451', 'space', 'reservation',
 '229317465', 'may', 'used', 'write', 'request', '$NET_PARAMS']

>>>message_vector
DenseVector([0.1383, 0.0731, 0.0735, -0.1011, -0.2651, -0.1491, -0.0743, 0.1554, 0.0807, 0.0083, 0.0125, 0.0632, 0.0143, -0.2925, -0.1025, 0.0662, -0.0552, -0.0416, 0.4768, 0.028, 0.3636, 0.0161, -0.1932, 0.0136, 0.1203, 0.04, -0.0742, 0.0995, 0.1008, -0.1848, -0.1794, -0.1532, 0.2411, 0.061, -0.0591, 0.1438, 0.0573, 0.2359, 0.0235, -0.0542, 0.0342, -0.1296, -0.0514, -0.1395, -0.2488, 0.0153, -0.2528, -0.1863, -0.3326, 0.1942, -0.1115, 0.2059, 0.0032, -0.0932, -0.1096, -0.0338, 0.1247, 0.0659, -0.1029, -0.4558, -0.0696, -0.0091, 0.0063, -0.2967, 0.0407, -0.3435, -0.1092, -0.0012, -0.0123, 0.0415, 0.12, 0.1081, -0.1571, 0.077, 0.0684, 0.0333, 0.0216, -0.1167, 0.0642, -0.0535, 0.047, -0.0239, 0.1295, 0.1738, 0.1266, -0.2792, -0.1288, -0.0204, -0.1236, -0.1516, -0.0372, -0.0418, -0.1654, -0.0501, -0.1204, 0.5067, 0.2329, -0.1069, 0.2374, -0.2035, -0.1956, 0.0969, 0.2348, -0.2235, -0.0555, -0.1337, -0.0708, 0.0662, 0.1976, 0.0443, 0.0872, -0.2308, 0.2779, -0.0324, -0.1952, 0.1897, 0.1196, -0.1664, 0.0967, 0.0549, 0.1488, 0.2577, 0.115, -0.1392, 0.1867, -0.17, 0.0554, 0.0927, -0.0067, -0.0677, -0.1122, 0.2298, -0.1198, 0.0499, 0.218, -0.4892, 0.0931, 0.3249, 0.2583, 0.1882, -0.1192, -0.0642, 0.0434, -0.1576, -0.1845, -0.1952, -0.0742, 0.0647, -0.0457, -0.118, -0.3698, -0.0634, -0.0867, -0.0989, -0.1391, 0.1643, 0.228, 0.2647, -0.0762, -0.1662, 0.1132, 0.1889, -0.0765, 0.0423, -0.1134, 0.1383, -0.0436, -0.0012, -0.0045, -0.1597, -0.1178, -0.2414, -0.0145, 0.1089, -0.0622, 0.0764, -0.1599, 0.2012, -0.2633, 0.0207, -0.14, 0.1064, 0.1348, 0.2031, -0.3826, -0.0889, 0.1819, -0.0669, 0.008, -0.2096, 0.2398, -0.0499, 0.236, 0.1427, -0.026, -0.053, 0.0113, 0.0116, -0.0028, 0.1189])

>>>len(stop_token_1), len(message_vector)
(14, 200)

我正在深入研究文档,但还没有找到明确的答案。有什么解释吗

谢谢:)


Tags: the模型clienttoken消息messagedataraw