从pysp中的数据帧中删除重复项

#loading the CSV file into an RDD in order to start working with the data rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect() #loading the RDD object into a dataframe and assigning column names df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect() #dropping duplicates from the dataframe df1.dropDuplicates().show()

2条回答

网友

1楼 · 编辑于 2024-05-15 22:02:40

这不是一个重要的问题。只需在错误的对象上调用.dropDuplicates()。虽然sqlContext.createDataFrame(rdd1, ...)的类是pyspark.sql.dataframe.DataFrame，但是在应用.collect()之后，它是一个普通的Python list，并且列表不提供dropDuplicates方法。你想要的是这样的东西：

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

网友

2楼 · 编辑于 2024-05-15 22:02:40

如果您有一个数据帧，并且希望删除所有重复项--引用特定列中的重复项（称为“colName”）：

重复数据消除前的计数：

df.count()

执行重复数据消除（将要进行重复数据消除的列转换为字符串类型）：

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

可以使用已排序的groupby检查是否已删除重复项：

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

相关问题更多 >

编程相关推荐

热门问题

热门文章