SparkSQL中DISTRIBUTE BY和Shuffle的区别

1条回答

网友

1楼 · 发布于 2024-05-16 11:31:12

让我试着回答你问题的每一部分：

As per my understanding, the Spark Sql optimizer will distribute the datasets of both the participating tables (of the join) based on the join keys (shuffle phase) to co-locate the same keys in the same partition. If that is the case, then if we use the distribute by in the sql, then also we are doing the same thing.

是的，没错。在

So in what way can distribute by could be used ameliorate join performance ?

有时您的一个表已经被分发，例如该表已被bucked或数据在join之前被同一个键聚合。在这种情况下，如果您显式地对第二个表（distributed by）进行重新分区，那么在join的两个分支中将实现相同的分区，并且Spark将不会在第一个分支中引发更多的shuffle（有时这被称为单边shuffle free join，因为shuffle只会发生在join的一个分支中，即你称之为重新分区/分发依据）。另一方面，如果不显式地重新划分另一个表，Spark将看到join的每个分支都有不同的分区，因此它将洗牌两个分支。所以在某些特殊情况下，调用repartition（distribute by）可以节省一次洗牌。在

请注意，要实现这一点，您还需要在两个分支中实现相同数量的分区。因此，如果您有两个表要在键user_id上联接，并且如果第一个表用这个键卡入10个存储桶，那么您需要用同一个键将另一个表也重新分区到10个分区中，然后连接将只有一个洗牌（在物理计划中，您可以看到只有一个表中有Exchange运算符）连接处的分支）。在

Or is it that it is better to use distribute by while writing the data to disk by the load process, so that subsequent queries using this data will benefit from it by not having to shuffle it ?

好吧，这实际上被称为bucketing（cluster by），它允许你对数据进行一次预洗牌，然后每次你读数据并用bucketing的同一个键连接它（或聚合），它就不会被洗牌了。所以，是的，这是一种非常常见的技术，在保存数据时只需支付一次成本，然后在每次读取数据时利用它。在

相关问题更多 >

编程相关推荐

热门问题

热门文章