pyspark中日期范围的计数id

2024-05-15 19:23:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pyspark数据框架,其中包含解析为_date(dtype:date)和id(dtype:bigint)的列,如下所示:

+-------+-----------+
|     id|parsed_date|
+-------+-----------+
|1471783| 2017-12-18|
|1471885| 2017-12-18|
|1472928| 2017-12-19|
|1476917| 2017-12-19|
|1477469| 2017-12-21|
|1478190| 2017-12-21|
|1478570| 2017-12-19|
|1481415| 2017-12-21|
|1472592| 2017-12-20|
|1474023| 2017-12-22|
|1474029| 2017-12-22|
|1474067| 2017-12-24|
+-------+-----------+

我有一个如下所示的函数。目的是传递日期(天)和t(天数)。在df1中,id在范围内计数(天-t,天),在df2中,id在范围内计数(天,天+t)

def hypo_1(df, day, t):
    df1 = (df.filter(f"parsed_date between '{day}' - interval {t} days and '{day}' - interval 1 day")
             .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    df2 = (df.filter(f"parsed_date between '{day}' + interval 1 day and '{day}' + interval {t} days")
             .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    return [df1, df2]

df1, df2 = hypo_1(df, '2017-12-20', 2)
df1.show()
+-------+-----------+------------+
|     id|parsed_date|count_before|
+-------+-----------+------------+
|1471783| 2017-12-18|           2|
|1471885| 2017-12-18|           2|
|1472928| 2017-12-19|           3|
|1476917| 2017-12-19|           3|
|1478570| 2017-12-19|           3|
+-------+-----------+------------+

df2.show()
+-------+-----------+-----------+
|     id|parsed_date|count_after|
+-------+-----------+-----------+
|1481415| 2017-12-21|          3|
|1478190| 2017-12-21|          3|
|1477469| 2017-12-21|          3|
|1474023| 2017-12-22|          2|
|1474029| 2017-12-22|          2|
+-------+-----------+-----------+

我想知道如果在该范围内缺少日期,如何修复此代码?假设没有2017-12-22的记录?是否有可能在记录中有即时天数?我的意思是,如果2017-12-22不在那里,而2017-12-21之后的下一个日期是2017-12-24,那么有可能采取某种方式吗

感谢mck帮助创建函数hypo_1(df, day, t)


Tags: 函数iddfdatecountfilterparsed计数
1条回答
网友
1楼 · 发布于 2024-05-15 19:23:22

我删除了2017-12-22行以进行说明。其思想是获得按日期排序的dense_rank(降序表示之前,升序表示之后),并使用秩<;=2,即两个最近的日期

from pyspark.sql import functions as F, Window

def hypo_1(df, day, t):
    df1 = (df.filter(f"parsed_date < '{day}'")
             .withColumn('rn', F.dense_rank().over(Window.orderBy(F.desc('parsed_date'))))
             .filter('rn <= 2')
             .drop('rn')
             .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    df2 = (df.filter(f"parsed_date > '{day}'")
             .withColumn('rn', F.dense_rank().over(Window.orderBy('parsed_date')))
             .filter('rn <= 2')
             .drop('rn')
             .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
             .orderBy('parsed_date')
          )
    return [df1, df2]

df1, df2 = hypo_1(df, '2017-12-20', 2)
df1.show()
+   -+     -+      +
|     id|parsed_date|count_before|
+   -+     -+      +
|1471783| 2017-12-18|           2|
|1471885| 2017-12-18|           2|
|1472928| 2017-12-19|           3|
|1476917| 2017-12-19|           3|
|1478570| 2017-12-19|           3|
+   -+     -+      +

df2.show()
+   -+     -+     -+
|     id|parsed_date|count_after|
+   -+     -+     -+
|1477469| 2017-12-21|          3|
|1481415| 2017-12-21|          3|
|1478190| 2017-12-21|          3|
|1474067| 2017-12-24|          1|
+   -+     -+     -+

相关问题 更多 >