我正在寻找加入2 pyspark数据帧内没有丢失任何数据。最简单的方法就是给你举个例子。甚至可以把它们数在一起分类。如果在desktop
或phone
列中为null,则在输出中应等于0。你知道吗
我试过:
desktop_df.join(phone_df, on='query')\
.fillna(0).orderBy("desktop", ascending=False)\
.show(20)
(还没有total列,所以我按count1排序)
但这种方法似乎不起作用-根本不显示零。你知道吗
桌面
query |desktop|
----------------
query1 | 12 |
----------------
query2 | 23 |
----------------
query3 | 8 |
----------------
query4 | 11 |
----------------
query6 | 45 |
----------------
query9 | 89 |
电话号码:
query | phone |
----------------
query1 | 21 |
----------------
query2 | 33 |
----------------
query4 | 11 |
----------------
query5 | 55 |
----------------
query6 | 45 |
----------------
query7 | 1234 |
----------------
query8 | 4321 |
----------------
query10| 10 |
----------------
query11| 1 |
我要查找的输出:
query | desktop| phone | total |
--------------------------------
query8 | 0 | 4321 | 4321 |
--------------------------------
query7 | 0 | 1234 | 1234 |
--------------------------------
query6 | 45 | 45 | 90 |
--------------------------------
query9 | 89 | 0 | 89 |
--------------------------------
query2 | 23 | 33 | 56 |
--------------------------------
query5 | 0 | 55 | 55 |
--------------------------------
query1 | 12 | 21 | 33 |
--------------------------------
query4 | 11 | 11 | 22 |
--------------------------------
query10| 0 | 10 | 10 |
--------------------------------
query3 | 8 | 0 | 8 |
--------------------------------
query11| 0 | 1 | 1 |
df = desktop_df.join(phone_df, on=["query"], how='fullouter').fillna(0).withColumn("total",col("desktop")+col("phone")).show(200)
或者
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
from pyspark.sql.functions import max
desktop_df = df.filter("hwType == 'DESKTOP'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','desktop')
phone_df = df.filter("hwType == 'PHONE'").groupby("query").count().orderBy("count", ascending=False).withColumnRenamed('count','phone')
# add missing column to each dataframe
desktop_df = desktop_df.withColumn('phone', lit(0)).select('query', 'desktop', 'phone')
phone_df = phone_df.withColumn('desktop', lit(0)).select('query', 'desktop', 'phone')
# union all and agg to select max value
phone_df.unionAll(desktop_df).groupBy('query').agg(max(col('desktop')).alias('desktop'), max(col('phone')).alias('phone'))
# withColumn('total', col('desktop') + col('phone')) \
# .orderBy(col('total').desc()) \
# .show()
可能在查询列上尝试内部联接。并通过添加列值来查找“Total”。你知道吗
您可以使用
unionAll
然后groupBy
来完成。你知道吗示例:
相关问题 更多 >
编程相关推荐