Apache Spark——将UDF的结果分配给多个dataframe列

+------+----------+--------------------+ |amount|trans_date| test| +------+----------+--------------------+ | 28.0|2016-02-07| [14.0, 0.0]| | 31.01|2016-02-07|[15.5050001144409...| | 13.41|2016-02-04|[6.70499992370605...| | 307.7|2015-02-17|[153.850006103515...| | 22.09|2016-02-05|[11.0450000762939...| +------+----------+--------------------+ only showing top 5 rows

df.select('amount','trans_date').withColumn("test", test_udf("amount")).printSchema() root |-- amount: float (nullable = true) |-- trans_date: string (nullable = true) |-- test: string (nullable = true)

2条回答

网友

1楼 · 编辑于 2024-04-24 09:00:44

不能从一个UDF调用创建多个顶级列，但可以创建一个新的struct。它需要具有指定returnType的自定义项：

from pyspark.sql.functions import udf
from pyspark.sql.types import *

schema = StructType([
    StructField("foo", FloatType(), False),
    StructField("bar", FloatType(), False)
])

def udf_test(n):
    return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))

test_udf = udf(udf_test, schema)
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])

foobars = df.select(test_udf("y").alias("foobar"))
foobars.printSchema()
## root
##  |-- foobar: struct (nullable = true)
##  |    |-- foo: float (nullable = false)
##  |    |-- bar: float (nullable = false)

使用简单的select进一步展平架构：

foobars.select("foobar.foo", "foobar.bar").show()
## +---+---+
## |foo|bar|
## +---+---+
## |1.0|0.0|
## |1.5|1.0|
## +---+---+

另见Derive multiple columns from a single column in a Spark DataFrame

网友

2楼 · 编辑于 2024-04-24 09:00:44

您可以使用flatMap一次性将列获取所需的数据帧

df=df.withColumn('udf_results',udf)  
df4=df.select('udf_results').rdd.flatMap(lambda x:x).toDF(schema=your_new_schema)

相关问题更多 >

编程相关推荐

热门问题

热门文章