创建新列并用来自其他dataframe的列填充它

2024-06-07 09:22:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用来自主源的值填充新的数据帧。如果ID不相同,我想用NEWCUSTOMER填充条目。我尝试了跟踪,但它抛出了一个错误,即该列不可编辑

任务: 我有“老”客户和“新”客户。我的目标是将testC中没有customerID的“新”客户分类为“新客户”。如果客户(列车中)存在customerID,那么它应该为testC中的客户提供列车中customerCategory的值

train.show(1)
testC = testC.withColumn("customerCategory", F.when(testC.customerID.contains(train.customerID),\
                                                    F.col(train.customerCategory)).otherwise("NEWCUSTOMER"))

+-----------+----------+------------+------+----+-----+--------------+-----+----------+----------+-----------+------+------------+--------------+----------------+
|orderItemID| orderDate|deliveryDate|itemID|size|color|manufacturerID|price|customerID|salutation|dateOfBirth| state|creationDate|returnShipment|customerCategory|
+-----------+----------+------------+------+----+-----+--------------+-----+----------+----------+-----------+------+------------+--------------+----------------+
|        148|2012-04-01|  2012-04-04|   651|  xl| blue|            46| 19.9|      1121|       Mrs|          ?|Berlin|  2012-04-01|             0|           GREEN|
+-----------+----------+------------+------+----+-----+--------------+-----+----------+----------+-----------+------+------------+--------------+----------------+
TypeError: Column is not iterable
TypeError                                 Traceback (most recent 
call last)
<command-3715636189631646> in <module>
  1 train.show(1)
  2 testC = testC.withColumn("customerCategory", 
     F.when(testC.customerID.contains(train.customerID),\
  ----> 3  F.col(train.customerCategory)).otherwise("NEWCUSTOMER"))

Tags: 数据客户showtraincolwhencontainscustomerid
2条回答

起始数据集testC的结构不清楚,但IIUC可以使用左join,然后仅在感兴趣的列上使用fillna方法

testC\
    .join(train, on='customerID', how='left')\
    .fillna('NEWCUSTOMER', subset=['customerCategory'])

我想试试LEFT JOIN

import pyspark.sql.functions as F

    train_df  = train\
.selectExpr('customerID as train_c_id', 'customerCategory as train_category' )\
.distinct()

mergedDF = train.join(testC, train['train_c_id']==testC['customerID'], how='left')

mergedDF = mergedDF.withColumn('customerCategory',\
 F.expr\
('case when customerID is not null then train_category else 'NEWCUSTOMER' END)

相关问题 更多 >

    热门问题