根据列中特定值的计数条件筛选spark数据帧中的行[pyspark中的spark.sql语法]

datalake_spark_dataframe_downsampled = pd.DataFrame( {'id' : ['001', '001', '001', '001', '001', '002', '002', '002'], 'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0, 1], 'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826, 31.784826], 'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70, 70], 'DaysDeploymentDate': [10, 20, 21, 31, 41, 11, 19, 57], 'label': [0, 0, 1, 1, 1, 0, 0, 1]} ) datalake_spark_dataframe_downsampled = spark.createDataFrame(datalake_spark_dataframe_downsampled ) # printSchema of the datalake_spark_dataframe_downsampled (spark df): "root |-- IMEI: string (nullable = true) |-- OuterSensorConnected: integer (nullable = false) |-- OuterHumidity: float (nullable = true) |-- EnergyConsumption: float (nullable = true) |-- DaysDeploymentDate: integer (nullable = true) |-- label: integer (nullable = false)"

datalake_spark_dataframe_downsampled_filtered = pd.DataFrame( {'id' : ['001', '001', '001', '001', '001'], 'OuterSensorConnected':[0, 0, 0, 1], 'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826], 'EnergyConsumption': [70, 70, 70, 70, 70], 'DaysDeploymentDate': [10, 20, 21, 31, 41], 'label': [0, 0, 1, 1, 1]} ) datalake_spark_dataframe_downsampled_filtered = spark.createDataFrame(datalake_spark_dataframe_downsampled_filtered)

datalake_spark_dataframe_downsampled_filtered.createOrReplaceTempView("df_filtered") spark_dataset_filtered=spark.sql("""SELECT *, count(label) as counted_label FROM df_filtered GROUP BY id HAVING counted_label >=2""") #how to only count the positive values here?

1条回答

网友

1楼 · 发布于 2024-06-11 09:22:19

使用窗口如何：

datalake_spark_dataframe_downsampled.createOrReplaceTempView("df_filtered")

spark.sql("""select * from (select *, sum(label) over (partition by id) as Sum_l
                      from df_filtered) where Sum_l >= 2""").drop("Sum_l").show()

+ -+          +      -+        -+         +  -+
| id|OuterSensorConnected|OuterHumidity|EnergyConsumption|DaysDeploymentDate|label|
+ -+          +      -+        -+         +  -+
|001|                   0|    31.784826|               70|                10|    0|
|001|                   0|    32.784826|               70|                20|    0|
|001|                   0|    33.784826|               70|                21|    1|
|001|                   1|    43.784826|               70|                31|    1|
|001|                   0|    23.784826|               70|                41|    1|
+ -+          +      -+        -+         +  -+

相关问题更多 >

编程相关推荐

热门问题

热门文章