如何在Pyspark中使用groupby删除条件中的列

from pyspark.sql import SparkSession from pyspark.sql import * from pyspark.sql.functions import * spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() fields = Row('accountname','clustername','namespace','cost') s1 = fields("account1","cluster_1_1","ns_1_1",10) s2 = fields("account1","cluster_1_1","ns_1_2",11) s3 = fields("account1","cluster_1_1","infra",12) s4 = fields("account1","cluster_1_2","infra",12) s5 = fields("account2","cluster_2_1","infra",13) s6 = fields("account3","cluster_3_1","ns_3_1",10) s7 = fields("account3","cluster_3_1","ns_3_2",11) s8 = fields("account3","cluster_3_1","infra",12) fieldsData=[s1,s2,s3,s4,s5,s6,s7,s8] df=spark.createDataFrame(fieldsData) df.show()

1条回答

网友

1楼 · 发布于 2024-05-15 08:26:41

检查一下，您可以首先使用按accountname&；分区的窗口函数计算clustername的计数；clustername，然后对count大于1且namespace=infra的行使用筛选器的否定

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w= Window.partitionBy("accountname", "clustername")

df.withColumn("count", F.count("clustername").over(w))\
    .filter(~((F.col("count")>1)&(F.col("namespace")=='infra')))\
    .drop("count").orderBy(F.col("accountname")).show()

+     -+     -+    -+  +
|accountname|clustername|namespace|cost|
+     -+     -+    -+  +
|   account1|cluster_1_1|   ns_1_1|  10|
|   account1|cluster_1_1|   ns_1_2|  11|
|   account1|cluster_1_2|    infra|  12|
|   account2|cluster_2_1|    infra|  13|
|   account3|cluster_3_1|   ns_3_1|  10|
|   account3|cluster_3_1|   ns_3_2|  11|
+     -+     -+    -+  +

相关问题更多 >

编程相关推荐

热门问题

热门文章