如何从RDD[PYSPARK]中删除重复值

网友

1楼 · 编辑于 2024-05-01 22:05:57

如果要从特定列或列集合中删除所有重复项，即对列集合执行distinct，那么pyspark具有dropDuplicates函数，该函数将接受要在其上进行区分的特定列集合。

阿卡

df.dropDuplicates(['value']).show()

网友

2楼 · 编辑于 2024-05-01 22:05:57

恐怕我对python一无所知，所以我在这个答案中提供的所有引用和代码都与java相关。但是，将它转换成python代码应该不是很困难。

你应该看看下面的webpage。它重定向到Spark的官方网页，其中提供了Spark支持的所有转换和操作的列表。

如果我没有弄错，最好的方法（在您的例子中）是使用distinct()转换，它返回一个新的数据集，其中包含源数据集（取自link）的不同元素。在java中，它类似于：

JavaPairRDD<Integer,String> myDataSet = //already obtained somewhere else
JavaPairRDD<Integer,String> distinctSet = myDataSet.distinct();

例如：

Partition 1:

1-y | 1-y | 1-y | 2-y
2-y | 2-n | 1-n | 1-n

Partition 2:

2-g | 1-y | 2-y | 2-n
1-y | 2-n | 1-n | 1-n

将转换为：

Partition 1:

1-y | 2-y
1-n | 2-n 

Partition 2:

1-y | 2-g | 2-y
1-n | 2-n |

当然，仍然会有多个RDD数据集，每个数据集包含一个不同元素的列表。

网友

3楼 · 编辑于 2024-05-01 22:05:57

使用Apache Spark的pyspark库的distinct操作很容易解决这个问题。

from pyspark import SparkContext, SparkConf

# Set up a SparkContext for local testing
if __name__ == "__main__":
    sc = SparkContext(appName="distinctTuples", conf=SparkConf().set("spark.driver.host", "localhost"))

# Define the dataset
dataset = [(u'1',u'y'),(u'1',u'y'),(u'1',u'y'),(u'1',u'n'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n'),(u'2',u'n')]

# Parallelize and partition the dataset 
# so that the partitions can be operated
# upon via multiple worker processes.
allTuplesRdd = sc.parallelize(dataset, 4)

# Filter out duplicates
distinctTuplesRdd = allTuplesRdd.distinct() 

# Merge the results from all of the workers
# into the driver process.
distinctTuples = distinctTuplesRdd.collect()

print 'Output: %s' % distinctTuples

这将输出以下内容：

Output: [(u'1',u'y'),(u'1',u'n'),(u'2',u'y'),(u'2',u'n')]

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从RDD[PYSPARK]中删除重复值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >