[BUG] Parquet scans producing more batches than required for the target batch size #11701

jlowe · 2024-11-06T20:30:54Z

While investigating some NDS queries, noticed that sometimes Parquet scans are producing more batches than required for the target batch size. For example, this screenshot of a query plan snippet from query23a running on Dataproc shows a Parquet scan producing 447 batches, averaging over two batches per task, followed by a filter that removes no rows, and then a coalesce that reduces the batch number from 447 to 156. That implies many tasks were producing more batches during the scan than necessary, and it's likely we would get sub-linear scaling if we processed the Parquet data in one-shot rather than many.

jlowe · 2024-11-06T20:32:29Z

First suspect was the Parquet chunked batch reader, but when the query was re-run with spark.rapids.sql.reader.chunked=false the extra batches in the scan persisted.

revans2 · 2024-11-07T17:40:34Z

At this point I think it is likely something to do with the different types of readers. This is happening in the cloud and it looks like the multi-threaded combining reader or just the multi-threaded reader is the cause of this.

jlowe · 2024-11-07T19:30:29Z

This is definitely the multithreaded coalesing reader, as @tgravescs confirmed. There are two configs that relate to this, see https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala#L1119-L1139. I suspect the code is not considering the fact that other data could show up while waiting for the GPU semaphore and thus be more efficient with what it has at the time it finally wakes up with the semaphore held.

jlowe added ? - Needs Triage Need team to review and classify bug Something isn't working performance A performance related task/issue labels Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet scans producing more batches than required for the target batch size #11701

[BUG] Parquet scans producing more batches than required for the target batch size #11701

jlowe commented Nov 6, 2024

jlowe commented Nov 6, 2024

revans2 commented Nov 7, 2024

jlowe commented Nov 7, 2024

[BUG] Parquet scans producing more batches than required for the target batch size #11701

[BUG] Parquet scans producing more batches than required for the target batch size #11701

Comments

jlowe commented Nov 6, 2024

jlowe commented Nov 6, 2024

revans2 commented Nov 7, 2024

jlowe commented Nov 7, 2024