pyspark中日期范围的计数id

+-------+-----------+ | id|parsed_date| +-------+-----------+ |1471783| 2017-12-18| |1471885| 2017-12-18| |1472928| 2017-12-19| |1476917| 2017-12-19| |1477469| 2017-12-21| |1478190| 2017-12-21| |1478570| 2017-12-19| |1481415| 2017-12-21| |1472592| 2017-12-20| |1474023| 2017-12-22| |1474029| 2017-12-22| |1474067| 2017-12-24| +-------+-----------+

def hypo_1(df, day, t): df1 = (df.filter(f"parsed_date between '{day}' - interval {t} days and '{day}' - interval 1 day") .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date'))) .orderBy('parsed_date') ) df2 = (df.filter(f"parsed_date between '{day}' + interval 1 day and '{day}' + interval {t} days") .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date'))) .orderBy('parsed_date') ) return [df1, df2] df1, df2 = hypo_1(df, '2017-12-20', 2) df1.show() +-------+-----------+------------+ | id|parsed_date|count_before| +-------+-----------+------------+ |1471783| 2017-12-18| 2| |1471885| 2017-12-18| 2| |1472928| 2017-12-19| 3| |1476917| 2017-12-19| 3| |1478570| 2017-12-19| 3| +-------+-----------+------------+ df2.show() +-------+-----------+-----------+ | id|parsed_date|count_after| +-------+-----------+-----------+ |1481415| 2017-12-21| 3| |1478190| 2017-12-21| 3| |1477469| 2017-12-21| 3| |1474023| 2017-12-22| 2| |1474029| 2017-12-22| 2| +-------+-----------+-----------+

1条回答

网友

1楼 · 发布于 2024-05-15 19:23:22

我删除了2017-12-22行以进行说明。其思想是获得按日期排序的dense_rank（降序表示之前，升序表示之后），并使用秩<；=2，即两个最近的日期

from pyspark.sql import functions as F, Window

def hypo_1(df, day, t):
    df1 = (df.filter(f"parsed_date < '{day}'")
             .withColumn('rn', F.dense_rank().over(Window.orderBy(F.desc('parsed_date'))))
             .filter('rn <= 2')
             .drop('rn')
             .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    df2 = (df.filter(f"parsed_date > '{day}'")
             .withColumn('rn', F.dense_rank().over(Window.orderBy('parsed_date')))
             .filter('rn <= 2')
             .drop('rn')
             .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    return [df1, df2]

df1, df2 = hypo_1(df, '2017-12-20', 2)
df1.show()
+   -+     -+      +
|     id|parsed_date|count_before|
+   -+     -+      +
|1471783| 2017-12-18|           2|
|1471885| 2017-12-18|           2|
|1472928| 2017-12-19|           3|
|1476917| 2017-12-19|           3|
|1478570| 2017-12-19|           3|
+   -+     -+      +

df2.show()
+   -+     -+     -+
|     id|parsed_date|count_after|
+   -+     -+     -+
|1477469| 2017-12-21|          3|
|1481415| 2017-12-21|          3|
|1478190| 2017-12-21|          3|
|1474067| 2017-12-24|          1|
+   -+     -+     -+

相关问题更多 >

编程相关推荐

热门问题

热门文章