Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Introduce kudo shuffle format. #2496

Open
liurenjie1024 opened this issue Oct 11, 2024 · 2 comments
Open

[FEA] Introduce kudo shuffle format. #2496

liurenjie1024 opened this issue Oct 11, 2024 · 2 comments

Comments

@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Oct 11, 2024

Design

Kudo serialization format is optimized for columnar batch serialization used during spark shuffle, which significantly improved serialization/deserialization time compared to jcudf serialization format. The improvements are based on two observations:

  1. During spark shuffle we have a lot of contexts provided by runtime such as table schema, that means we could simplify headers. In kudo's header, we only contains necessary fields such as offset, number of rows, data lens, and one byte for each column to indicate whether it has validity buffer.
  2. Gpu's columnar batch is typically much larger than cpu's vectorized execution engine, that means we almost always need to do batch concatanation in shuffle read time. When serializing a part of a larger columnar batch, unlike jcudf, which calculates exact validity buffer and offset buffer, we only record offset and number of rows in header, and copy necessay bytes, since the exact buffer could be restored at read time when do concating. This saves a lot of compuation when doing serialization.

Performance

We have observed 30%-4000% serialization time improvement, up to 200% deserialization time improvement, and similar concat batching performance.

@jlowe
Copy link
Member

jlowe commented Oct 24, 2024

@liurenjie1024 please add details to this. As-is, it's just a headline with nothing else to go on.

@liurenjie1024
Copy link
Collaborator Author

@liurenjie1024 please add details to this. As-is, it's just a headline with nothing else to go on.

Fixed.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Oct 30, 2024
This is the first pr of [a larger one](NVIDIA/spark-rapids-jni#2532) to introduce a new serialization format. It make `ai.rapids.cudf.HostMemoryBuffer#copyFromStream` public. For more background, see NVIDIA/spark-rapids-jni#2496

Authors:
  - Renjie Liu (https://github.com/liurenjie1024)
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Alessandro Bellina (https://github.com/abellina)

URL: #17179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants