Skip to content

Latest commit

 

History

History
30 lines (20 loc) · 1.19 KB

Drastic Coalesce.md

File metadata and controls

30 lines (20 loc) · 1.19 KB

Coalesce in Spark —

Spark offers two transformations to resize(increase/decrease) the RDD/DF/DS

  1. repartition
  2. coalesce

repartition results in shuffle, can be used to increase/decrease the partitions
coalesce may or may not result in shuffle, directly depends on the # of partitions

In case of partition increase, it stays at the current # of partitions, as if nothing happens
but in case of patition decrease, coalesce optimized data shuffle by merging local partitions ie. within executor

We've 2 types of coalesce as follows:

  1. coalesce
  2. drastic coalesce

sample example:

dataframe.coalesce(n), where n is # of partitions.
if n < # of data-nodes then drastic coalesce takes place which leads to data shuffle, o'wise shuffle doesn't take place

image

Best Practice:
if we've imbalanced size of partitions, it's better to re-distribute then using different partitioning logic

Reference:

  1. https://stackoverflow.com/a/67915928/6842300
  2. https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.coalesce