[FEA] Introduce kudo shuffle format. #2496

liurenjie1024 · 2024-10-11T06:55:12Z

Design

Kudo serialization format is optimized for columnar batch serialization used during spark shuffle, which significantly improved serialization/deserialization time compared to jcudf serialization format. The improvements are based on two observations:

During spark shuffle we have a lot of contexts provided by runtime such as table schema, that means we could simplify headers. In kudo's header, we only contains necessary fields such as offset, number of rows, data lens, and one byte for each column to indicate whether it has validity buffer.
Gpu's columnar batch is typically much larger than cpu's vectorized execution engine, that means we almost always need to do batch concatanation in shuffle read time. When serializing a part of a larger columnar batch, unlike jcudf, which calculates exact validity buffer and offset buffer, we only record offset and number of rows in header, and copy necessay bytes, since the exact buffer could be restored at read time when do concating. This saves a lot of compuation when doing serialization.

Performance

We have observed 30%-4000% serialization time improvement, up to 200% deserialization time improvement, and similar concat batching performance.

jlowe · 2024-10-24T13:56:33Z

@liurenjie1024 please add details to this. As-is, it's just a headline with nothing else to go on.

liurenjie1024 · 2024-10-25T06:21:54Z

@liurenjie1024 please add details to this. As-is, it's just a headline with nothing else to go on.

Fixed.

This is the first pr of [a larger one](NVIDIA/spark-rapids-jni#2532) to introduce a new serialization format. It make `ai.rapids.cudf.HostMemoryBuffer#copyFromStream` public. For more background, see NVIDIA/spark-rapids-jni#2496 Authors: - Renjie Liu (https://github.com/liurenjie1024) - Jason Lowe (https://github.com/jlowe) Approvers: - Jason Lowe (https://github.com/jlowe) - Alessandro Bellina (https://github.com/abellina) URL: #17179

firestarman mentioned this issue Oct 11, 2024

[FEA] [EPIC] Add support for kudo shuffle serialization format. NVIDIA/spark-rapids#11590

Open

3 tasks

liurenjie1024 mentioned this issue Oct 25, 2024

Make ai.rapids.cudf.HostMemoryBuffer#copyFromStream public. rapidsai/cudf#17179

Merged

3 tasks

liurenjie1024 mentioned this issue Oct 29, 2024

Add utility methods for kudo #2542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Introduce kudo shuffle format. #2496

[FEA] Introduce kudo shuffle format. #2496

liurenjie1024 commented Oct 11, 2024 •

edited

Loading

jlowe commented Oct 24, 2024

liurenjie1024 commented Oct 25, 2024

[FEA] Introduce kudo shuffle format. #2496

[FEA] Introduce kudo shuffle format. #2496

Comments

liurenjie1024 commented Oct 11, 2024 • edited Loading

Design

Performance

jlowe commented Oct 24, 2024

liurenjie1024 commented Oct 25, 2024

liurenjie1024 commented Oct 11, 2024 •

edited

Loading