我有一个数据集,可以提供人们在电视上看了什么、看了多长时间以及在哪个网络上看了什么。我们有以下专栏:
TV ID - string
show_nw - Array which has TV show concatenated with Network
All_nws - Array which has network concatenated with Duration
All_shows - Array which has show concatenated with duration
样本数据集:
TV_ID : a1001 , a1002, a1003
show_nw: ["TheFactsofLife#1001","Bewitched#1001","Survivor#1000","SEALTeam#1000","WhenWhalesWalkedJourneysinDeepTime#1002","PaidProgramming#1006"], ["AllEliteWrestlingDynamite#1003","TheAdjustmentBureau#1004","Charmed#1003"], ["TMJ4Now#1005"]
all_nws : ["1000#7062","1001#602","1002#40","1006#47"], ["1003#7328","1004#46"], ["1005#1543"]
all_shows : ["Bewitched#563","Survivor#6988","SEALTeam#74","WhenWhalesWalkedJourneysinDeepTime#40","PaidProgramming#47","TheFactsofLife#39"], ["Charmed#462","AllEliteWrestlingDynamite#6866","TheAdjustmentBureau#46"], ["TMJ4Now#1543"]
现在,当我从数组中分解数据集时
test_df = df.select("tv_id", "all_shows", "all_nws").withColumn("all_shows", explode("all_shows")).withColumn("all_nws", explode("all_nws")).withColumn("show",split(col("all_shows"),"#").getItem(0)).withColumn("network",split(col("all_nws"),"#").getItem(0))
我的输出如下所示:
tv_id all_shows all_nws show network
a1001 Bewitched#563 1000#7062 Bewitched 1000
a1001 Bewitched#563 1001#602 Bewitched 1001
a1001 Bewitched#563 1002#40 Bewitched 1002
a1001 Bewitched#563 1006#47 Bewitched 1006
a1001 Survivor#6988 1000#7062 Survivor 1000
a1001 Survivor#6988 1001#602 Survivor 1001
a1001 Survivor#6988 1002#40 Survivor 1002
a1001 Survivor#6988 1006#47 Survivor 1006
因此,基本上在父数据集中,《魔术师》和《幸存者》只在网络1000上观看,但当爆炸时,我们发现它们都与电视ID拥有的所有网络有关。在这种情况下,爆炸后如何获取正确的数据集
我认为你需要在爆炸前做
zip arrays
从
Spark-2.4
使用arrays_zip
函数对于
Spark < 2.4
需要使用udfExample:(Spark-2.4)
相关问题 更多 >
编程相关推荐