[BUG] Parquet scans producing more batches than required for the target batch size #11701
Labels
? - Needs Triage
Need team to review and classify
bug
Something isn't working
performance
A performance related task/issue
While investigating some NDS queries, noticed that sometimes Parquet scans are producing more batches than required for the target batch size. For example, this screenshot of a query plan snippet from query23a running on Dataproc shows a Parquet scan producing 447 batches, averaging over two batches per task, followed by a filter that removes no rows, and then a coalesce that reduces the batch number from 447 to 156. That implies many tasks were producing more batches during the scan than necessary, and it's likely we would get sub-linear scaling if we processed the Parquet data in one-shot rather than many.
The text was updated successfully, but these errors were encountered: