PySpark在UDF中调用子集函数

def check_neighbours(distance): df = rangequery(a,distances, 9) if df.count()>=1: return "Has Neighbours" else: return "No Neighbours" udf_neigh=udf(check_neighbours, StringType()) a.withColumn("label", udf_neigh( a["distances"])).show()

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o380.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

1条回答

网友

1楼 · 发布于 2024-06-02 08:38:32

从this answer大量借用这是一种方法。考虑以下示例：

from pyspark.sql.functions import col, udf
# create dummy dataset
DB = sqlCtx.createDataFrame(
    [("A", [0,1]), ("B", [5,9]), ("D", [13,5])],
    ["Letter", "distances"]
)

# Define your distance metric as a udf 
from scipy.spatial import distance
distance_udf = udf(lambda x, y: float(distance.euclidean(x, y)), FloatType())

# Use crossJoin() to compute distances.
eps = 9  # minimum distance 
DB.alias("l")\
    .crossJoin(DB.alias("r"))\
    .where(distance_udf(col("l.distances"), col("r.distances")) < eps)\
    .groupBy("l.letter", "l.distances")\
    .count()\
    .withColumn("count", col("count") - 1)\
    .withColumn("label", udf(lambda x: "Has Neighbours" if x >= 1 else "No Neighbours")(col("count")))\
    .sort('letter')\
    .show()

输出：

^{pr2}$
其中.withColumn("count", col("count") - 1)是因为我们知道每个列都将自己作为一个普通的邻居。（您可以根据需要删除此行。）
正如@user8371915在linked post中提到的那样，您编写的代码不起作用：
you cannot reference distributed DataFrame in udf

相关问题更多 >

编程相关推荐

热门问题

热门文章