如何在pyspark数据帧中使用write.partitionBy时删除重复项？

2024-06-16 11:35:54 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个如下所示的数据帧：

|------------|-----------|---------------|---------------|
|    Name    |   Type    |  Attribute 1  |  Attribute 2  |
|------------|-----------|---------------|---------------|
|   Roger    |     A     |     X         |       Y       |
|------------|-----------|---------------|---------------|
|   Roger    |     A     |     X         |       Y       |
|------------|-----------|---------------|---------------|
|   Roger    |     A     |     X         |       Y       |
|------------|-----------|---------------|---------------|
|   Rafael   |     A     |     G         |       H       |
|------------|-----------|---------------|---------------|
|   Rafael   |     A     |     G         |       H       |
|------------|-----------|---------------|---------------|
|   Rafael   |     B     |     G         |       H       |
|------------|-----------|---------------|---------------|

我想对这个数据帧进行分区，并根据名称和类型将其保存到磁盘

代码行当前看起来像这样

df.write.partitionBy("Name", "Type").mode("append").csv("output/", header=True)

输出正确保存，但有重复的行，如下所述

文件夹中

/output/Roger/A

|---------------|---------------|
|  Attribute 1  |  Attribute 2  |
|---------------|---------------|
|     X         |       Y       |
|---------------|---------------|
|     X         |       Y       |
|---------------|---------------|
|     X         |       Y       |
|---------------|---------------|

/output/Rafael/A

|---------------|---------------|
|  Attribute 1  |  Attribute 2  |
|---------------|---------------|
|     G         |       H       |
|---------------|---------------|
|     G         |       H       |
|---------------|---------------|

/output/Rafael/B

|---------------|---------------|
|  Attribute 1  |  Attribute 2  |
|---------------|---------------|
|     G         |       H       |
|---------------|---------------|

如您所见，此csv包含重复项。使用write.partitionbY时如何删除这些重复项

Tags： csv 数据代码 name 名称类型 df output

1条回答

网友

1楼 · 发布于 2024-06-16 11:35:54

在写之前使用.distinct()

df.distinct().write.partitionBy("Name", "Type").mode("append").csv("output/", header=True)

如何在pyspark数据帧中使用write.partitionBy时删除重复项？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在pyspark数据帧中使用write.partitionBy时删除重复项？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >