# generate 13 x 10 array and creates rdd with 13 records, each record contains a list with 10 elements
rdd = sc.parallelize([range(10) for i in range(13)])
def make_selector(cols):
"""use closure to configure select_col function
:param cols: list - contains columns' indexes to select from every record
"""
def select_cols(record):
return [record[c] for c in cols]
return select_cols
s = make_selector([1,2])
s([0,1,2])
>>> [1, 2]
rdd.map(make_selector([0, 3, 9])).take(5)
听起来你需要过滤列,但不是记录。为此,您需要使用Spark的map函数来转换表示为RDD的数组中的每一行。请参见我的示例:
结果
^{pr2}$这与@vvladymyrov的答案基本相同,但没有闭包:
结果
^{pr2}$相关问题 更多 >
编程相关推荐