Pyspark从每个组的列中获取第一个值

|Id1| id2 |row |grp | |12 | 1234 |1 | 1 | |23 | 1123 |2 | 1 | |45 | 2343 |3 | 2 | |65 | 2345 |1 | 2 | |67 | 3456 |2 | 2 |``` I need to retrieve value for id2 corresponding to row = 1 and update all id2 values within a grp to that value. This should be the final result |Id1 | id2 |row |grp| |12 |1234 |1 |1 | |23 |1234 |2 |1 | |45 |2345 |3 |2 | |65 |2345 |1 |2 | |67 |2345 |2 |2 |

2条回答

网友

1楼 · 编辑于 2024-06-17 11:49:37

试试这个：

from pyspark.sql import functions as F, Window as W


df.withColumn(
    "id2",
    F.first("id2").over(
        W.partitionBy("grp")
        .orderBy("row")
        .rowsBetween(W.unboundedPreceding, W.currentRow)
    ),
).show()

+ -+  + -+ -+                                                              
|id1| id2|row|grp|
+ -+  + -+ -+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 45|2345|  2|  2|
| 45|2345|  3|  2|
+ -+  + -+ -+

网友

2楼 · 编辑于 2024-06-17 11:49:37

与@Steven的答案非常相似，没有使用.rowsBetween

基本上，您可以为每个grp创建一个Window，然后按row对行进行排序，并首先为每个grp选择{}

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window.partitionBy('grp').orderBy('row')

df = df.withColumn('id2', F.first('id2').over(w))

df.show()

+ -+  + -+ -+
|Id1| id2|row|grp|
+ -+  + -+ -+
| 12|1234|  1|  1|
| 23|1234|  2|  1|
| 65|2345|  1|  2|
| 67|2345|  2|  2|
| 45|2345|  3|  2|
+ -+  + -+ -+

相关问题更多 >

编程相关推荐

热门问题

热门文章