数据帧上的Pypark余弦相似性

Customer1 Customer2 v_cust1 v_cust2 cosine_sim 1 2 0.9 0.1 0.1 1 3 0.3 0.4 0.9 1 4 0.2 0.9 0.15 2 1 0.8 0.8 1

1条回答

网友

1楼 · 发布于 2024-04-24 01:31:29

如果您更愿意使用pandas_udf，那么效率会更高。你知道吗

它在矢量化操作方面比spark udf执行得更好：Introducing Pandas UDF for PySpark

from pyspark.sql.functions import PandasUDFType, pandas_udf
import pyspark.sql.functions as F

# Names of columns 
a, b = "v_cust1", "v_cust2"
cosine_sim_col = "cosine_sim"

# Make a reserved column to fill the values since the constraint of pandas_udf
# is that the input schema and output schema has to remain the same.
df = df.withColumn("cosine_sim", F.lit(1.0).cast("double"))

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def cos_sim(df):
    df[cosine_sim_col] = float(np.dot(df[a], df[b]) / (np.linalg.norm(df[a]) * np.linalg.norm(df[b])))
    return df


# Assuming that you want to groupby Customer1 and Customer2 for arrays
df2 = df.groupby(["Customer1", "Customer2"]).apply(cos_sim)

# But if you want to send entire columns then make a column with the same 
# value in all rows and group by it. For e.g.:
df3 = df.withColumn("group", F.lit("group_a")).groupby("group").apply(cos_sim)

相关问题更多 >

编程相关推荐

热门问题

热门文章

数据帧上的Pypark余弦相似性

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >