每个tim的火花窗功能

ID| Page | User | Timestamp | |1|Page 1 |Ericd |2002-09-07 19:39:55| |1|Page 1 |Liir |2002-10-12 03:01:42| |1|Page 1 |Tubby |2002-10-12 03:02:23| |1|Page 1 |Mojo |2002-10-12 03:18:24| |1|Page 1 |Kirf |2002-10-12 03:19:03| |2|Page 2 |The Epopt |2001-11-28 22:27:37| |2|Page 2 |Conversion script|2002-02-03 01:49:16| |2|Page 2 |Bryan Derksen |2002-02-25 16:51:15| |2|Page 2 |Gear |2002-10-04 12:46:06| |2|Page 2 |Tim Starling |2002-10-06 08:13:42| |2|Page 2 |Tim Starling |2002-10-07 03:00:54| |2|Page 2 |Salsa Shark |2003-03-18 01:45:32|

|1|Page 1 |Liir |2002-10-12 03:01:42| |1|Page 1 |Tubby |2002-10-12 03:02:23| |1|Page 1 |Mojo |2002-10-12 03:18:24| |1|Page 1 |Kirf |2002-10-12 03:19:03| |2|Page 2 |Gear |2002-10-04 12:46:06| |2|Page 2 |Tim Starling |2002-10-06 08:13:42| |2|Page 2 |Tim Starling |2002-10-07 03:00:54|

days = lambda i: i * 86400 window = (Window().partitionBy(col("page")) .orderBy(col("timestamp").cast("timestamp").cast("long")).rangeBetween(-days(30), 0)) df = df.withColumn("monthly_occurrences", func.count("user").over(window)) df.show()

2条回答

网友

1楼 · 编辑于 2024-04-25 19:08:58

您可以首先创建包含年-月组合的列，然后使用该列进行分组。一个有效的例子是：

import pyspark.sql.functions as F

df = sc.parallelize([
    ('2018-06-02T00:00:00.000Z','tim', 'page 1' ),
    ('2018-07-20T00:00:00.000Z','tim', 'page 1' ),
    ('2018-07-20T00:00:00.000Z','john', 'page 2' ),
    ('2018-07-20T00:00:00.000Z','john', 'page 2' ),
    ('2018-08-20T00:00:00.000Z','john', 'page 2' )
]).toDF(("datetime","user","page" ))

df = df.withColumn('yearmonth',F.concat(F.year('datetime'),F.lit('-'),F.month('datetime')))    
df_agg = df.groupBy('yearmonth','page').count()
df_agg.show()

输出：

+    -+   +  -+
|yearmonth|  page|count|
+    -+   +  -+
|   2018-7|page 2|    2|
|   2018-6|page 1|    1|
|   2018-7|page 1|    1|
|   2018-8|page 2|    1|
+    -+   +  -+

希望这有帮助！你知道吗

网友

2楼 · 编辑于 2024-04-25 19:08:58

如果您正在寻找动态期间，首先将日期转换为时间戳，然后从今天开始减去所有时间戳，然后将（整数）除以要分组的时间间隔的时间戳。下面的代码按5天的间隔对行进行分组。你知道吗

import pyspark.sql.functions as F
from datetime import datetime

# todays timestamp
Today = datetime.today().timestamp()
# how many timestamp is today 
DAY_TIMESTAMPS = 24 * 60 * 60

df = sc.parallelize([
    ('2017-06-02 00:00:00','tim', 'page 1' ),
    ('2017-07-20 00:00:00','tim', 'page 1' ),
    ('2017-07-21 00:00:00','john', 'page 2' ),
    ('2017-07-22 00:00:00','john', 'page 2' ),
    ('2017-08-23 00:00:00','john', 'page 2' )
]).toDF(("datetime","user","page" ))

# group by five days
timeInterval = 5* DAY_TIMESTAMPS

df \
    .withColumn('timestamp', F.unix_timestamp(F.to_date('datetime', 'yyyy-MM-dd HH:mm:ss'))) \ 
    .withColumn('timeIntervalBefore', ((Today-F.col('timestamp'))/(timeInterval)).cast('integer')) \
    .groupBy('timeIntervalBefore', 'page') \
    .agg(F.count('user').alias('number of users')).show()

结果：

+         +   +       -+
|timeIntervalBefore|  page|number of users|
+         +   +       -+
|                70|page 2|              2|
|                80|page 1|              1|
|                70|page 1|              1|
|                64|page 2|              1|
+         +   +       -+

如果您需要估计时间段的日期：

df \
    .withColumn('timestamp', F.unix_timestamp(F.to_date('datetime', 'yyyy-MM-dd HH:mm:ss'))) \
    .withColumn('timeIntervalBefore', ((Today-F.col('timestamp'))/(timeInterval)).cast('integer')) \
    .groupBy('timeIntervalBefore', 'page') \
    .agg(
        F.count('user').alias('number_of_users'), 
        F.min('timestamp').alias('FirstDay'), 
        F.max('timestamp').alias('LastDay')) \
    .select(
        'page', 
        'number_of_users', 
        F.from_unixtime('firstday').alias('firstDay'), 
        F.from_unixtime('firstday').alias('lastDay')).show()

结果：

+   +       -+         -+         -+
|  page|number_of_users|           firstDay|            lastDay|
+   +       -+         -+         -+
|page 2|              2|2017-07-21 00:00:00|2017-07-21 00:00:00|
|page 1|              1|2017-06-02 00:00:00|2017-06-02 00:00:00|
|page 1|              1|2017-07-20 00:00:00|2017-07-20 00:00:00|
|page 2|              1|2017-08-23 00:00:00|2017-08-23 00:00:00|
+   +       -+         -+         -+

相关问题更多 >

编程相关推荐

热门问题

热门文章