火花java.lang.NoSuchMethodE

2024-04-20 16:35:10 发布

您现在位置：Python中文网/ 问答频道 /正文

9151

网友

男 | 程序猿一只，喜欢编程写python代码。

我在Spark-on-YARN上使用scipy-cosine相似性运行了以下udf。我首先对30个样本数据进行了测试。它运行良好，在5秒内创建了一个余弦相似矩阵。在

代码如下：

def cosineSimilarity(df):
    """ Cosine similarity of the each document with other

    """

    from pyspark.sql.functions import udf
    from pyspark.sql.types import DoubleType
    from scipy.spatial import distance

    cosine = udf(lambda v1, v2: (
     float(1-distance.cosine(v1, v2)) if v1 is not None and v2 is not None else None),
     DoubleType())

    # Creating a cross product of the table to get the cosine similarity vectors 

    crosstabDF=df.withColumnRenamed('id','id_1').withColumnRenamed('w2v_vector','w2v_vector_1')\
    .join(df.withColumnRenamed('id','id_2').withColumnRenamed('w2v_vector','w2v_vector_2'))

    similardocs_df= crosstabDF.withColumn('cosinesim', cosine("w2v_vector_1","w2v_vector_2"))

    return similardocs_df

#similardocs_df=cosineSimilarity(w2vdf.select('id','w2v_vector'))


similardocs_df=cosineSimilarity(w2vdf_sample.select('id','w2v_vector'))

然后我试图传递整个矩阵（58K条记录），它运行了一段时间，然后给出了以下错误：

我想提一下，有一次它确实在5分钟内运行了整个数据。但是现在在整个数据上，它给了我这个错误，而它在sample上运行时没有问题。在

^{pr2}$

Tags： the 数据 from import id df v2 v1

1条回答

网友

1楼 · 发布于 2024-04-20 16:35:10

我在pyspark中也遇到过这个错误，我通过在spark submit命令中添加一些jar来解决这个问题。在

jars/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/lib/spark-examples-1.6.0-cdh5.9.0-hadoop2.6.0-cdh5.9.0.jar

火花java.lang.NoSuchMethodE

相关问题更多 >

编程相关推荐

热门问题

热门文章

火花java.lang.NoSuchMethodE

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >