2024-05-15 02:05:02 发布
网友
我想对PySpark上的数据帧进行分层抽样。有一个sampleBy(col, fractions, seed=None)函数,但它似乎只使用一个列作为层。有没有办法用多根柱子作为地层?在
sampleBy(col, fractions, seed=None)
根据答案here
在将其转换为python之后,我认为答案可能如下:
#create a dataframe to use df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"]) #we are going to use the first two columns as our key (strata) #assign sampling percentages to each key # you could do something cooler here fractions = df.rdd.map(lambda x: (x[0],x[1])).distinct().map(lambda x: (x,0.3)).collectAsMap() #setup how we want to key the dataframe kb = df.rdd.keyBy(lambda x: (x[0],x[1])) #create a dataframe after sampling from our newly keyed rdd #note, if the sample did not return any values you'll get a `ValueError: RDD is empty` error sampleddf = kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns) sampleddf.show() + -+ + -+ | ID| X| Y| + -+ + -+ | 1|1234|282| | 1|1396|179| | 3|1620|191| + -+ + -+ #other examples kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show() + -+ + -+ | ID| X| Y| + -+ + -+ | 2|8620|178| + -+ + -+ kb.sampleByKey(False,fractions).map(lambda x: x[1]).toDF(df.columns).show() + -+ + -+ | ID| X| Y| + -+ + -+ | 1|1234|282| | 1|1396|179| + -+ + -+
这就是你要找的那种东西吗?在
根据答案here
在将其转换为python之后,我认为答案可能如下:
这就是你要找的那种东西吗?在
相关问题 更多 >
编程相关推荐