Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet scans producing more batches than required for the target batch size #11701

Open
jlowe opened this issue Nov 6, 2024 · 3 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working performance A performance related task/issue

Comments

@jlowe
Copy link
Member

jlowe commented Nov 6, 2024

While investigating some NDS queries, noticed that sometimes Parquet scans are producing more batches than required for the target batch size. For example, this screenshot of a query plan snippet from query23a running on Dataproc shows a Parquet scan producing 447 batches, averaging over two batches per task, followed by a filter that removes no rows, and then a coalesce that reduces the batch number from 447 to 156. That implies many tasks were producing more batches during the scan than necessary, and it's likely we would get sub-linear scaling if we processed the Parquet data in one-shot rather than many.

Image

@jlowe jlowe added ? - Needs Triage Need team to review and classify bug Something isn't working performance A performance related task/issue labels Nov 6, 2024
@jlowe
Copy link
Member Author

jlowe commented Nov 6, 2024

First suspect was the Parquet chunked batch reader, but when the query was re-run with spark.rapids.sql.reader.chunked=false the extra batches in the scan persisted.

@revans2
Copy link
Collaborator

revans2 commented Nov 7, 2024

At this point I think it is likely something to do with the different types of readers. This is happening in the cloud and it looks like the multi-threaded combining reader or just the multi-threaded reader is the cause of this.

@jlowe
Copy link
Member Author

jlowe commented Nov 7, 2024

This is definitely the multithreaded coalesing reader, as @tgravescs confirmed. There are two configs that relate to this, see https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala#L1119-L1139. I suspect the code is not considering the fact that other data could show up while waiting for the GPU semaphore and thus be more efficient with what it has at the time it finally wakes up with the semaphore held.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants