Central to the concept of Dataset
is an Encoder framework
,
that provides Dataset with storage
and execution
efficiency gains as compared to RDDs
.
An encoder of a particular type encodes either an Java object or a data record into the binary format backed by raw memory and vice-versa.
Encoders are part of Spark’s tungusten framework.
Being backed by the raw memory, updation or querying of relevant information from the encoded binary text is done via Java Unsafe APIs.
Encoder outputs - Binary Format:
There are the 3 broad benefits provided by Encoders empowering Datasets to their present glory:
- Storage efficiency:
- Query efficiency:
- Shuffle efficiency:
Reference: