如何在pySpark数据帧中添加行id

debug_csv_rdd = (sc.textFile("debug.csv") .filter(lambda x: x.find('header') == -1) .map(lambda x : x.replace("NULL","0")).map(lambda p: p.split(',')) .map(lambda x:Row(c1=int(x[0]),c2=int(x[1]),c3=int(x[2]),c4=int(x[3])))) debug_csv_df = sqlContext.createDataFrame(debug_csv_rdd) debug_csv_df.registerTempTable("debug_csv_table") sqlContext.cacheTable("debug_csv_table") r0 = sqlContext.sql("SELECT c2 FROM debug_csv_table WHERE c1 = 'str'") r0.registerTempTable("r0_table") r0_1 = (r0.flatMap(lambda x:x) .zipWithIndex() .map(lambda x: Row(c1=x[0],id=int(x[1])))) r0_df=sqlContext.createDataFrame(r0_2) r0_df.show(10)

1条回答

网友

1楼 · 发布于 2024-05-19 00:40:44

也可以使用sql包中的函数。它将生成一个唯一的id，但是它不是连续的，因为它取决于分区的数量。我相信Spark 1.5+

from pyspark.sql.functions import monotonicallyIncreasingId

# This will return a new DF with all the columns + id
res = df.withColumn("id", monotonicallyIncreasingId())

编辑：19/1/2017

由@Sean评论

使用monotonically_increasing_id()代替Spark 1.6和on

相关问题更多 >

编程相关推荐

热门问题

热门文章