我需要在python中使用pyspark将一个dataframe的多个列合并为一个列,并使用list(或tuple)作为该列的值。
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
看看这个文件:https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
如果这仍然相关,则可以使用StringIndexer对字符串值进行编码,以使用浮点替换。
列可以与sparks数组函数合并:
为了使合并成功,您可能需要更改条目的类型
相关问题 更多 >
编程相关推荐