Spark offers two transformations to resize(increase/decrease) the RDD/DF/DS
- repartition
- coalesce
repartition
results in shuffle, can be used to increase/decrease the partitions
coalesce
may or may not result in shuffle, directly depends on the # of partitions
In case of partition increase, it stays at the current # of partitions, as if nothing happens
but in case of patition decrease, coalesce optimized data shuffle by merging local partitions ie. within executor
We've 2 types of coalesce as follows:
- coalesce
- drastic coalesce
sample example:
dataframe.coalesce(n)
, where n
is # of partitions.
if
n < # of data-nodes then drastic coalesce takes place which leads to data shuffle, o'wise shuffle doesn't take place
Best Practice:
if we've imbalanced size of partitions, it's better to re-distribute then using different partitioning logic
Reference: