当连接键以列表形式给出时，如何修改spark数据帧中连接的列？

data1 = [[1,'2018-07-31',215,'a'], [2,'2018-07-30',None,'b'], [3,'2017-10-28',201,'c'] ] df_1 = sqlCtx.createDataFrame(data1, ['application_number','application_dt','account_id','var1'])

data2 = [[1,'2018-07-31',215,'aaa'], [2,'2018-07-30',None,'bbb'], [3,'2017-10-28',201,'ccc'] ] df_2 = sqlCtx.createDataFrame(data2, ['application_number','application_dt','account_id','var2'])

+------------------+--------------+----------+----+----+ |application_number|application_dt|account_id|var1|var2| +------------------+--------------+----------+----+----+ | 1| 2018-07-31| 215| a| aaa| | 3| 2017-10-28| 201| c| ccc| | 2| 2018-07-30| null| b|null| +------------------+--------------+----------+----+----+

+------------------+--------------+----------+----+----+ |application_number|application_dt|account_id|var1|var2| +------------------+--------------+----------+----+----+ | 1| 2018-07-31| 215| a| aaa| | 3| 2017-10-28| 201| c| ccc| | 2| 2018-07-30| null| b| bbb| +------------------+--------------+----------+----+----+

join_elem = "df_1.application_number == df_2.application_number|df_1.application_dt == df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) == F.coalesce(df_2.account_id,F.lit(0))".split("|") join_elem_column = [eval(x) for x in join_elem]

1条回答

网友

1楼 · 发布于 2024-04-19 02:38:15

我把这个解决方案称为变通方法。你知道吗

这里的问题是，对于DataFrame中的一个键，我们有Null值，OP希望使用其余的键列。为什么不给这个Null赋一个任意值，然后应用连接呢。实际上，这和在剩下的两个键上进行连接是一样的。你知道吗

# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))    
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))

# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')

# Replace the arbitrary value back with Null.    
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+         +       +     +  +  +
|application_number|application_dt|account_id|var1|var2|
+         +       +     +  +  +
|                 1|    2018-07-31|       215|   a| aaa|
|                 2|    2018-07-30|      null|   b| bbb|
|                 3|    2017-10-28|       201|   c| ccc|
+         +       +     +  +  +

相关问题更多 >

编程相关推荐

热门问题

热门文章