如何计算列的累积和并创建新列？

Location Month Brand Sector TrueValue PickoutValue USA 1/1/2021 brand1 cars1 7418 30000 USA 2/1/2021 brand1 cars1 1940 2000 USA 3/1/2021 brand1 cars1 4692 2900 USA 4/1/2021 brand1 cars1 USA 1/1/2021 brand2 cars2 16383104.2 16666667 USA 2/1/2021 brand2 cars2 26812874.2 16666667 USA 3/1/2021 brand2 cars2 USA 1/1/2021 brand3 cars3 75.6% 70.0% USA 3/1/2021 brand3 cars3 73.1% 70.0% USA 2/1/2021 brand3 cars3 77.1% 70.0%

Location Month Brand Sector TrueValue PickoutValue TotalSumValue USA 1/1/2021 brand1 cars1 7418 30000 7418 USA 2/1/2021 brand1 cars1 1940 2000 9358 USA 3/1/2021 brand1 cars1 4692 2900 14050 USA 4/1/2021 brand1 cars1 14050 USA 1/1/2021 brand2 cars2 16383104.2 16666667 16383104.2 USA 2/1/2021 brand2 cars2 26812874.2 16666667 43195978.4 USA 3/1/2021 brand2 cars2 43195978.4 USA 1/1/2021 brand3 cars3 75.6% 70.0% 75.6% USA 3/1/2021 brand3 cars3 73.1% 70.0% 76.3% USA 2/1/2021 brand3 cars3 77.1% 70.0% 75.3%

df=df.withColumn("month_in_timestamp", to_timestamp(df.Month, 'dd/MM/yyyy')) windowval = (Window.partitionBy('Brand','Sector').orderBy('Month') .rangeBetween(Window.unboundedPreceding, 0)) df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))

1条回答

网友

1楼 · 发布于 2024-05-29 10:24:40

对于%值的计算似乎是一个累积平均值计算。如果是这样，则可以对不包含%的值应用累积和，并对包含%的值应用累积平均值（计算前先删除百分号）。您可以使用when-otherwise应用这两种计算

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = df.withColumn("month_in_timestamp", F.to_timestamp(F.col("Month"), 'dd/MM/yyyy'))

# use 'month_in_timestamp' instead of 'month' 
windowval = (Window.partitionBy('Brand','Sector').orderBy('month_in_timestamp')
             .rangeBetween(Window.unboundedPreceding, 0))

df = df.withColumn("TotalSumValue", 
                   F.when(F.col("TrueValue").contains("%"), 
                          F.concat(F.avg(F.expr("replace(TrueValue, '%', '')")).over(windowval).cast("decimal(4,1)"), F.lit("%")))
                    .otherwise(F.sum('TrueValue').over(windowval).cast("decimal(13,1)")))

df.show()

# +    +    +   +   +     +      +         -+      -+
# |Location|   Month| Brand|Sector| TrueValue|PickoutValue| month_in_timestamp|TotalSumValue|
# +    +    +   +   +     +      +         -+      -+
# |     USA|1/1/2021|brand1| cars1|      7418|       30000|2021-01-01 00:00:00|       7418.0|
# |     USA|2/1/2021|brand1| cars1|      1940|        2000|2021-01-02 00:00:00|       9358.0|
# |     USA|3/1/2021|brand1| cars1|      4692|        2900|2021-01-03 00:00:00|      14050.0|
# |     USA|4/1/2021|brand1| cars1|      null|        null|2021-01-04 00:00:00|      14050.0|
# |     USA|1/1/2021|brand2| cars2|16383104.2|    16666667|2021-01-01 00:00:00|   16383104.2|
# |     USA|2/1/2021|brand2| cars2|26812874.2|    16666667|2021-01-02 00:00:00|   43195978.4|
# |     USA|3/1/2021|brand2| cars2|      null|        null|2021-01-03 00:00:00|   43195978.4|
# |     USA|1/1/2021|brand3| cars3|     75.6%|       70.0%|2021-01-01 00:00:00|        75.6%|
# |     USA|2/1/2021|brand3| cars3|     77.1%|       70.0%|2021-01-02 00:00:00|        76.4%|
# |     USA|3/1/2021|brand3| cars3|     73.1%|       70.0%|2021-01-03 00:00:00|        75.3%|
# +    +    +   +   +     +      +         -+      -+

相关问题更多 >

编程相关推荐

热门问题

热门文章