pyspark将一个列拆分为多个不带panda的列

1条回答

网友

1楼 · 发布于 2024-05-23 16:18:26

火花>；=2.2

您可以跳过unix_timestamp，转换并使用to_date或to_timestamp：

from pyspark.sql.functions import to_date, to_timestamp

df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show()
## +---+----------+
## | id|      date|
## +---+----------+
## |  1|2015-07-14|
## |  2|2015-06-14|
## |  3|2015-10-11|
## +---+----------+


df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show()
## +---+-------------------+
## | id|               date|
## +---+-------------------+
## |  1|2015-07-14 00:00:00|
## |  2|2015-06-14 00:00:00|
## |  3|2015-10-11 00:00:00|
## +---+-------------------+

然后应用下面显示的其他日期时间函数。

火花<；2.2

不能在一个访问中派生多个顶级列。可以将结构或集合类型与UDF一起使用，如下所示：

from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import Row
from pyspark.sql.functions import udf, col

schema = StructType([
  StructField("day", StringType(), True),
  StructField("month", StringType(), True),
  StructField("year", StringType(), True)
])

def split_date_(s):
    try:
        d, m, y = s.split("-")
        return d, m, y
    except:
        return None

split_date = udf(split_date_, schema)

transformed = df_test.withColumn("date", split_date(col("date")))
transformed.printSchema()

## root
##  |-- id: long (nullable = true)
##  |-- date: struct (nullable = true)
##  |    |-- day: string (nullable = true)
##  |    |-- month: string (nullable = true)
##  |    |-- year: string (nullable = true)

但它不仅在PySpark中相当冗长，而且价格昂贵。

对于基于日期的转换，您只需使用内置函数：

from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format

transformed = (df_test
    .withColumn("ts",
        unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp"))
    .withColumn("day", dayofmonth(col("ts")).cast("string"))
    .withColumn("month", date_format(col("ts"), "MMM"))
    .withColumn("year", year(col("ts")).cast("string"))
    .drop("ts"))

类似地，您可以使用regexp_extract分割日期字符串。

另见Derive multiple columns from a single column in a Spark DataFrame

注意：

如果对SPARK-11724使用未修补的版本，则需要在unix_timestamp(...)之后和cast("timestamp")之前进行更正。

相关问题更多 >

编程相关推荐

热门问题

热门文章

pyspark将一个列拆分为多个不带panda的列

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >