Coalesce in Spark —

Spark offers two transformations to resize(increase/decrease) the RDD/DF/DS

repartition
coalesce

repartition results in shuffle, can be used to increase/decrease the partitions
coalesce may or may not result in shuffle, directly depends on the # of partitions

In case of partition increase, it stays at the current # of partitions, as if nothing happens
but in case of patition decrease, coalesce optimized data shuffle by merging local partitions ie. within executor

We've 2 types of coalesce as follows:

coalesce
drastic coalesce

sample example:

dataframe.coalesce(n), where n is # of partitions.
if n < # of data-nodes then drastic coalesce takes place which leads to data shuffle, o'wise shuffle doesn't take place

Best Practice:
if we've imbalanced size of partitions, it's better to re-distribute then using different partitioning logic

Reference:

https://stackoverflow.com/a/67915928/6842300
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.coalesce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!