Pyspark在值位于列表中时替换DF值

hashed_customer firstname lastname email order_id status timestamp eater 1_uuid 1_firstname 1_lastname 1_email 12345 OPTED_IN 2020-05-14 20:45:15 eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22 eater 3_uuid 3_firstname 3_lastname 3_email 34567 OPTED_IN 2020-05-14 19:31:55 eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27

hashed_customer firstname lastname email order_id status timestamp eater 1_uuid NaN NaN NaN 12345 OPTED_IN 2020-05-14 20:45:15 eater 2_uuid 2_firstname 2_lastname 2_email 23456 OPTED_IN 2020-05-14 20:29:22 eater 3_uuid NaN NaN NaN 34567 OPTED_IN 2020-05-14 19:31:55 eater 4_uuid 4_firstname 4_lastname 4_email 45678 OPTED_IN 2020-05-14 17:49:27

1条回答

网友

1楼 · 发布于 2024-05-26 09:20:26

如果我要复制您的确切逻辑，我们可以执行以下操作（内联注释）：

l = df2.select("hashed_customer").collect()
cols_to_update = ['firstname','lastname','email'] # list of cols to update
#use when + otherwise in a loop for the cols_to_update
cond = [F.when(F.col('hashed_customer').isin([i[0] for i in l]),
           F.lit(None)).otherwise(F.col(col)).alias(col) 
           for col in cols_to_update]
#select the changed columns and other columns
final = df1.select(*cond,*[a for a in df1.columns if a not in cols_to_update])
#order as the original dataframe
final.select(*df1.columns).show()

+       -+     -+     +   -+    +    +         -+
|hashed_customer|  firstname|  lastname|  email|order_id|  status|          timestamp|
+       -+     -+     +   -+    +    +         -+
|   eater 1_uuid|       null|      null|   null|   12345|OPTED_IN|2020-05-14 20:45:15|
|   eater 2_uuid|2_firstname|2_lastname|2_email|   23456|OPTED_IN|2020-05-14 20:29:22|
|   eater 3_uuid|       null|      null|   null|   34567|OPTED_IN|2020-05-14 19:31:55|
|   eater 4_uuid|4_firstname|4_lastname|4_email|   45678|OPTED_IN|2020-05-14 17:49:27|
+       -+     -+     +   -+    +    +         -+

相关问题更多 >

编程相关推荐

热门问题

热门文章