pyspark word2vec对西里尔文单词引发异常

2024-06-16 09:00:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用pyspark Word2Vec教程和一些twitter数据来构建一个向量,以便将来在KMeans中使用

当我运行synonyms = model.findSynonyms('привет', 5)时,它会引发py4j.protocol.Py4JJavaError:

我试过使用:

synonyms = model.findSynonyms(u'привет'.encode('utf-8'), 10)
synonyms = model.findSynonyms(u'привет'.decode('utf-8'), 10)
synonyms = model.findSynonyms(u'\xd0\xbf\xd0\xb8\xd0\xb7\xd0\xb4\xd0\xb5\xd1\x86'.encode('utf-8'), 10)
synonyms = model.findSynonyms(u'\xd0\xbf\xd0\xb8\xd0\xb7\xd0\xb4\xd0\xb5\xd1\x86', 10)
inp = sc.textFile("data/mllib/sample_lda_data.txt").map(lambda row: row.split(" "))

word2vec = Word2Vec()
model = word2vec.fit(inp)

synonyms = model.findSynonyms('1', 5)

for word, cosine_distance in synonyms:
    print("{}: {}".format(word, cosine_distance))

期望值:

>>> for word, cosine_distance in synonyms:
...     print("{}: {}".format(word.encode('utf-8'), cosine_distance))
... 
look: 0.91164034605
phone: 0.910009503365
Been: 0.90544962883
number.: 0.904221653938
Look: 0.903845191002

但我无法到达那里,因为findSynonyms()不适用于西里尔文字

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/feature.py", line 611, in findSynonyms
    words, similarity = self.call("findSynonyms", word, num)
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/common.py", line 146, in call
    return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/mllib/common.py", line 123, in callJavaFunc
    return _java2py(sc, func(*args))
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/hadoop/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: <exception str() failed>

Tags: inpyhadoophomemodelbinlinespark