使用窗口函数的pyspark

|movie_id|year|categories| +--------+----+----------+ | 122|1990| Comedy| | 122|1990| Romance| | 185|1990| Action| | 185|1990| Crime| | 185|1990| Thriller| | 231|1990| Comedy| | 292|1990| Action| | 292|1990| Drama| | 292|1990| Sci-Fi| | 292|1990| Thriller| | 316|1990| Action| | 316|1990| Adventure| | 316|1990| Sci-Fi| | 329|1990| Action| | 329|1990| Adventure| | 329|1990| Drama| . . .

+-----------------------------------+ | year | category | movie_id | rank | +-----------------------------------+ | 1990 | Comedy | 1273 | 1 | | 1990 | Comedy | 6547 | 2 | | 1990 | Comedy | 8973 | 3 | . . | 1990 | Comedy | 7483 | 10 | . . | 1990 | Drama | 1273 | 1 | | 1990 | Drama | 6547 | 2 | | 1990 | Drama | 8973 | 3 | . . | 1990 | Comedy | 7483 | 10 | . . | 2000 | Comedy | 1273 | 1 | | 2000 | Comedy | 6547 | 2 | . . for every decade, top 10 movies in each category

windowSpec = Window.partitionBy(res_agg['year']).orderBy(res_agg['categories'].desc()) final = res_agg.select(res_agg['year'], res_agg['movie_id'], res_agg['categories']).withColumn('rank', func.rank().over(windowSpec))

+----+--------+------------------+----+ |year|movie_id| categories|rank| +----+--------+------------------+----+ |2000| 8606|(no genres listed)| 1| |2000| 1587| Action| 1| |2000| 1518| Action| 1| |2000| 2582| Action| 1| |2000| 5460| Action| 1| |2000| 27611| Action| 1| |2000| 48304| Action| 1| |2000| 54995| Action| 1| |2000| 4629| Action| 1| |2000| 26606| Action| 1| |2000| 56775| Action| 1| |2000| 62008| Action| 1|

1条回答

网友

1楼 · 发布于 2024-06-16 08:25:18

你是对的，你需要使用一个窗口，但是首先，你需要执行第一次聚合来计算频率。你知道吗

首先，让我们计算十年。你知道吗

df_decade = df.withColumn("decade", concat(substring(col("year"), 0, 3), lit("0")))

然后我们按十年、类别和电影id计算频率：

agg_df = df_decade\
      .groupBy("decade", "category", "movie_id")\
      .agg(count(col("*")).alias("freq"))

最后，我们定义一个按年代和类别划分的窗口，并使用秩函数选择前10名：

w = Window.partitionBy("decade", "category").orderBy(desc("freq"))
top10 = agg_df.withColumn("r", rank().over(w)).where(col("r") <= 10)

相关问题更多 >

编程相关推荐

热门问题

热门文章