PySpark sampleBy使用多个列

2024-05-15 02:05:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我想对PySpark上的数据帧进行分层抽样。有一个sampleBy(col, fractions, seed=None)函数,但它似乎只使用一个列作为层。有没有办法用多根柱子作为地层?在


Tags: 数据函数nonecolpysparkseed办法地层
1条回答
网友
1楼 · 发布于 2024-05-15 02:05:02

根据答案here

在将其转换为python之后,我认为答案可能如下:

#create a dataframe to use
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])

#we are going to use the first two columns as our key (strata)
#assign sampling percentages to each key # you could do something cooler here
fractions = df.rdd.map(lambda x: (x[0],x[1])).distinct().map(lambda x: (x,0.3)).collectAsMap()

#setup how we want to key the dataframe
kb = df.rdd.keyBy(lambda x: (x[0],x[1]))

#create a dataframe after sampling from our newly keyed rdd
#note, if the sample did not return any values you'll get a `ValueError: RDD is empty` error

sampleddf = kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns)
sampleddf.show()
+ -+  + -+
| ID|   X|  Y|
+ -+  + -+
|  1|1234|282|
|  1|1396|179|
|  3|1620|191|
+ -+  + -+
#other examples
kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
+ -+  + -+
| ID|   X|  Y|
+ -+  + -+
|  2|8620|178|
+ -+  + -+


kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show()
+ -+  + -+
| ID|   X|  Y|
+ -+  + -+
|  1|1234|282|
|  1|1396|179|
+ -+  + -+

这就是你要找的那种东西吗?在

相关问题 更多 >

    热门问题