使用Pandas UDF的PySpark序列计数

+----------+------+-----------+----+-----------+------------+ | Date|column|column_2 |co_3|column_4 |column_5 | +----------+------+-----------+----+-----------+------------+ |2016-12-14| 0| 0| 0| 14| 0| |2016-12-14| 0| 0| 0| 14| 0| |2016-12-14| 0| 0| 0| 18| 0| |2016-12-14| 0| 0| 0| 19| 0| |2016-12-14| 0| 0| 0| 20| 0| |2016-12-14| 0| 0| 0| 26| 0| |2016-12-14| 0| 0| 0| 60| 0| |2016-12-14| 0| 0| 0| 63| 0| |2016-12-14| 0| 0| 0| 78| 0| |2016-12-14| 0| 0| 0| 90| 0| +----------+------+-----------+----+-----------+------------+

sdf.filter(sdf.Date == "2016-12-14").sort("Date_Count").show() +------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+ |Date_Convert| Date|column|column_____|col_|column_____|column______|Date_Year|Date_Month|Date_Day|Date_Epoch|count|Date_Count| +------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+ | 2016-12-14|2016-12-14| 0| 0| 0| 14| 0| 2016| 12| 14|1481673600|14504| 0| | 2016-12-14|2016-12-14| 0| 0| 0| 18| 0| 2016| 12| 14|1481673600|14504| 0| | 2016-12-14|2016-12-14| 0| 0| 0| 14| 0| 2016| 12| 14|1481673600|14504| 1| | 2016-12-14|2016-12-14| 0| 0| 0| 18| 0| 2016| 12| 14|1481673600|14504| 1| | 2016-12-14|2016-12-14| 0| 0| 0| 18| 0| 2016| 12| 14|1481673600|14504| 2| | 2016-12-14|2016-12-14| 0| 0| 0| 14| 0| 2016| 12| 14|1481673600|14504| 2| | 2016-12-14|2016-12-14| 0| 0| 0| 14| 0| 2016| 12| 14|1481673600|14504| 3| +------------+----------+------+-----------+----+-----------+------------+---------+----------+--------+----------+-----+----------+

1条回答

网友

1楼 · 发布于 2024-04-25 12:31:14

组合使用Window和row_number函数应该可以解决这个问题。正如你所说，我已经使用了所有列进行排序

dataset that has increasing values any columns for the same month...

您只能使用一个或多个值。在

from pyspark.sql import window as w
windowSpec = w.Window.partitionBy("Date").orderBy("column", "column_2", "co_3", "column_4", "column_5")

from pyspark.sql import functions as f
df.withColumn('inc_count', f.row_number().over(windowSpec)).show(truncate=False)

它应该给你

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章