使用python将pyspark dataframe中的多个列合并为一列

2024-04-26 10:52:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要在python中使用pyspark将一个dataframe的多个列合并为一个列,并使用list(或tuple)作为该列的值。

Input dataframe:

+-------+-------+-------+-------+-------+
| name  |mark1  |mark2  |mark3  | Grade |
+-------+-------+-------+-------+-------+
| Jim   | 20    | 30    | 40    |  "C"  |
+-------+-------+-------+-------+-------+
| Bill  | 30    | 35    | 45    |  "A"  |
+-------+-------+-------+-------+-------+
| Kim   | 25    | 36    | 42    |  "B"  |
+-------+-------+-------+-------+-------+

Output dataframe should be

+-------+-----------------+
| name  |marks            |
+-------+-----------------+
| Jim   | [20,30,40,"C"]  |
+-------+-----------------+
| Bill  | [30,35,45,"A"]  |
+-------+-----------------+
| Kim   | [25,36,42,"B"]  |
+-------+-----------------+

Tags: namedataframeinputoutputlistpysparkgradeshould
3条回答

看看这个文件:https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["mark1", "mark2", "mark3"],
    outputCol="marks")

output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)

如果这仍然相关,则可以使用StringIndexer对字符串值进行编码,以使用浮点替换。

列可以与sparks数组函数合并:

import pyspark.sql.functions as f

columns = [f.col("mark1"), ...] 

output = input.withColumn("marks", f.array(columns)).select("name", "marks")

为了使合并成功,您可能需要更改条目的类型

相关问题 更多 >