Pyspark从现有数组列创建特定长度的数组列

+-----+----+------------+------------+-------------+------------+ | Name| Age| P_Attribute|S_Attributes|P_Values |S_values | +-----+----+------------+------------+-------------+------------+ | Bob1| 16 | [x1,x2] | [x1,x3]|["ab",1] | [1,2] | | Bob2| 16 |[x1,x2,x3] | [] |["a","b","c"]| [] | +-----+----+------------+------------+-------------+------------+

+-----+----+------------+------------+ | Name| Age| Attribute | Values| +-----+----+------------+------------+ | Bob1| 16 | x1 | ab | | Bob1| 16 | x2 | 1 | | Bob1| 16 | x1 | 1 | | Bob1| 16 | x3 | 2 | | Bob2| 16 | x1 | a | | Bob2| 16 | x2 | b | | Bob2| 16 | x3 | c | +-----+----+------------+------------+

+-----+----+------------+------------+------------+ | Name| Age| Attribute| type |Value | +-----+----+------------+------------+------------+ | Bob1| 16 | x1 | 1 | ab | | Bob1| 16 | x2 | 1 | 1 | | Bob1| 16 | x1 | 2 | 1 | | Bob1| 16 | x3 | 2 | 2 | | Bob2| 16 | x1 | 1 | a | | Bob2| 16 | x2 | 1 | b | | Bob2| 16 | x3 | 1 | c | +-----+----+------------+------------+------------+

+-----+----+------------+------------+------------+------------+ | Name| Age| P_Attribute|S_Attributes|P_type |S_type | +-----+----+------------+------------+------------+------------+ | Bob1| 16 | [x1,x2] | [x1,x3]| [1,1] | [2,2] | | Bob2| 16 |[x1,x2,x3] | [] | [1,1,1] | [] | +-----+----+------------+------------+------------+------------+

2条回答

网友

1楼 · 编辑于 2024-04-27 05:05:03

我建议使用高阶函数transform，使用struct和array_union，然后使用explode once选择两者

df.show()
#+  + -+      +      +
#|Name|Age| P_Attribute|S_Attributes|
#+  + -+      +      +
#|Bob1| 16|    [x1, x2]|    [x1, x3]|
#|Bob2| 16|[x1, x2, x3]|          []|
#+  + -+      +      +

from pyspark.sql import functions as F
df.withColumn("Attributes", F.explode(F.array_union(F.expr("""transform(P_Attribute,x-> struct(x as Attribute,1 as Type))"""),\
              F.expr("""transform(S_Attributes,x-> struct(x as Attribute,2 as Type))"""))))\
   .select("Name", "Age", "Attributes.*").show()

#+  + -+    -+  +
#|Name|Age|Attribute|Type|
#+  + -+    -+  +
#|Bob1| 16|       x1|   1|
#|Bob1| 16|       x2|   1|
#|Bob1| 16|       x1|   2|
#|Bob1| 16|       x3|   2|
#|Bob2| 16|       x1|   1|
#|Bob2| 16|       x2|   1|
#|Bob2| 16|       x3|   1|
#+  + -+    -+  +

UPDATE:

df.show()

#+  + -+      +      +    -+    +
#|Name|Age| P_Attribute|S_Attributes| P_Values|S_values|
#+  + -+      +      +    -+    +
#|Bob1| 16|    [x1, x2]|    [x1, x3]|  [ab, 1]|  [1, 2]|
#|Bob2| 16|[x1, x2, x3]|          []|[a, b, c]|      []|
#+  + -+      +      +    -+    +

from pyspark.sql import functions as F
df.withColumn("Attributes", F.explode(F.array_union\
               (F.expr("""transform(arrays_zip(P_Attribute,P_Values),x->\
                          struct(x.P_Attribute as Attribute,1 as Type,string(x.P_Values) as Value))"""),\
                F.expr("""transform(arrays_zip(S_Attributes,S_Values),x->\
                          struct(x.S_Attributes as Attribute,2 as Type,string(x.S_Values) as Value))"""))))\
   .select("Name", "Age", "Attributes.*").show()

#+  + -+    -+  +  -+
#|Name|Age|Attribute|Type|Value|
#+  + -+    -+  +  -+
#|Bob1| 16|       x1|   1|   ab|
#|Bob1| 16|       x2|   1|    1|
#|Bob1| 16|       x1|   2|    1|
#|Bob1| 16|       x3|   2|    2|
#|Bob2| 16|       x1|   1|    a|
#|Bob2| 16|       x2|   1|    b|
#|Bob2| 16|       x3|   1|    c|
#+  + -+    -+  +  -+

网友

2楼 · 编辑于 2024-04-27 05:05:03

你可以做如下的事情。首先将P_attributes和S_attributes收集到单个Attributes列中，然后对其执行posexplode，这将根据需要提供引用属性源（P或S）的type列。最后explode单击Attributes列以展平所有属性

import pyspark.sql.functions as f

df = spark.createDataFrame([
    ['Bob1', 16, ['x1', 'x2'], ['x1', 'x3']],
    ['Bob2', 16, ['x1', 'x2', 'x3'], []]],
    ['Name', 'Age', 'P_Attribute', 'S_Attributes'])

df.withColumn('Attributes', f.array('P_Attribute', 'S_Attributes'))\
  .select('Name', 'Age', f.posexplode('Attributes').alias('type', 'Attribute'))\
  .withColumn('Attribute', f.explode('Attribute'))\
  .show()

+  + -+  +    -+
|Name|Age|type|Attribute|
+  + -+  +    -+
|Bob1| 16|   0|       x1|
|Bob1| 16|   0|       x2|
|Bob1| 16|   1|       x1|
|Bob1| 16|   1|       x3|
|Bob2| 16|   0|       x1|
|Bob2| 16|   0|       x2|
|Bob2| 16|   0|       x3|
+  + -+  +    -+

相关问题更多 >

编程相关推荐

热门问题

热门文章