Native Presto vs Vanilla Presto performance comparison #8305

xumingming · 2024-01-09T12:03:03Z

xumingming
Jan 9, 2024

I have done a performance test against Native Presto and Vanilla Presto, the basic setup is:

1 Coordinator: 4Core 16GB Memory
2 Worker: 16Core 64GB Memory each.
Test Data: TPCH 100GB Parquet data on Alibaba Cloud OSS

And the result is following:

== Elapse Time ==
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Query ┃ vanilla_tpch_4 ┃ native_tpch_4 ┃ Speedup ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ q1    │ 29s 924ms      │ 4s 162ms      │ 7.19x   │
│ q2    │ 11s 24ms       │ 6s 335ms      │ 1.74x   │
│ q3    │ 26s 885ms      │ 24s 512ms     │ 1.10x   │
│ q4    │ 17s 214ms      │ 7s 580ms      │ 2.27x   │
│ q5    │ 35s 387ms      │ 44s 938ms     │ 0.79x   │
│ q6    │ 12s 516ms      │ 2s 13ms       │ 6.22x   │
│ q7    │ 36s 600ms      │ 16s 500ms     │ 2.22x   │
│ q8    │ 31s 858ms      │ 25s 992ms     │ 1.23x   │
│ q9    │ 45s 545ms      │ 37s 843ms     │ 1.20x   │
│ q10   │ 28s 781ms      │ 15s 107ms     │ 1.91x   │
│ q11   │ 27s 697ms      │ 21s 555ms     │ 1.28x   │
│ q12   │ 20s 147ms      │ 6s 20ms       │ 3.35x   │
│ q13   │ 21s 654ms      │ 10s 203ms     │ 2.12x   │
│ q14   │ 16s 555ms      │ 3s 328ms      │ 4.97x   │
│ q15   │ 24s 317ms      │ 6s 626ms      │ 3.67x   │
│ q16   │ 11s 221ms      │ 2s 853ms      │ 3.93x   │
│ q17   │ 49s 484ms      │ 42s 199ms     │ 1.17x   │
│ q18   │ 47s 166ms      │ 42s 858ms     │ 1.10x   │
│ q19   │ 24s 386ms      │ 4s 309ms      │ 5.66x   │
│ q20   │ 24s 151ms      │ 11s 734ms     │ 2.06x   │
│ q21   │ 38s 506ms      │ 25s 834ms     │ 1.49x   │
│ q22   │ 8s 16ms        │ 3s 909ms      │ 2.05x   │
│ Total │ 9m 49s 34ms    │ 6m 6s 410ms   │ 1.61x   │
└───────┴────────────────┴───────────────┴─────────┘

== Cpu Time ==
┏━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Query ┃ vanilla_tpch_4  ┃ native_tpch_4 ┃ Speedup ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ q1    │ 13m 43s 800ms   │ 1m 28s 800ms  │ 9.28x   │
│ q2    │ 2m 53s 400ms    │ 51s 540ms     │ 3.36x   │
│ q3    │ 9m 58s 800ms    │ 2m 33s        │ 3.91x   │
│ q4    │ 7m 6s           │ 2m 27s        │ 2.90x   │
│ q5    │ 12m 9s 600ms    │ 6m 8s 400ms   │ 1.98x   │
│ q6    │ 4m 53s 400ms    │ 44s 630ms     │ 6.57x   │
│ q7    │ 16m 2s 400ms    │ 3m 30s        │ 4.58x   │
│ q8    │ 12m 42s         │ 3m 14s 400ms  │ 3.92x   │
│ q9    │ 19m 3s          │ 4m 55s 800ms  │ 3.86x   │
│ q10   │ 9m 53s 400ms    │ 2m 19s 800ms  │ 4.24x   │
│ q11   │ 2m 41s 400ms    │ 31s 830ms     │ 5.07x   │
│ q12   │ 8m 32s 400ms    │ 1m 12s        │ 7.12x   │
│ q13   │ 7m 58s 200ms    │ 3m 7s 800ms   │ 2.55x   │
│ q14   │ 5m 55s 200ms    │ 1m 14s 400ms  │ 4.77x   │
│ q15   │ 11m 39s 600ms   │ 2m 29s 400ms  │ 4.68x   │
│ q16   │ 1m 52s 800ms    │ 30s 860ms     │ 3.66x   │
│ q17   │ 20m 28s 200ms   │ 5m 15s        │ 3.90x   │
│ q18   │ 19m 37s 800ms   │ 6m 42s        │ 2.93x   │
│ q19   │ 10m 7s 200ms    │ 1m 23s 400ms  │ 7.28x   │
│ q20   │ 9m 44s 400ms    │ 2m 9s         │ 4.53x   │
│ q21   │ 14m 58s 200ms   │ 5m 25s 800ms  │ 2.76x   │
│ q22   │ 2m 4s 200ms     │ 38s 280ms     │ 3.24x   │
│ Total │ 3h 44m 5s 400ms │ 58m 53s 140ms │ 3.81x   │
└───────┴─────────────────┴───────────────┴─────────┘

== Peak Memory ==
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Query ┃ vanilla_tpch_4 ┃ native_tpch_4 ┃ Speedup ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ q1    │ 327.5 MB       │ 128.0 MB      │ 2.56x   │
│ q2    │ 4.5 GB         │ 5.0 GB        │ 0.89x   │
│ q3    │ 17.2 GB        │ 20.8 GB       │ 0.83x   │
│ q4    │ 6.6 GB         │ 9.6 GB        │ 0.69x   │
│ q5    │ 33.9 GB        │ 37.5 GB       │ 0.91x   │
│ q6    │ 2.0 GB         │ 94.0 MB       │ 21.68x  │
│ q7    │ 6.8 GB         │ 8.5 GB        │ 0.80x   │
│ q8    │ 2.8 GB         │ 3.4 GB        │ 0.84x   │
│ q9    │ 9.0 GB         │ 11.1 GB       │ 0.81x   │
│ q10   │ 8.5 GB         │ 10.0 GB       │ 0.85x   │
│ q11   │ 797.2 MB       │ 418.0 MB      │ 1.91x   │
│ q12   │ 670.2 MB       │ 518.0 MB      │ 1.29x   │
│ q13   │ 1.7 GB         │ 1.9 GB        │ 0.90x   │
│ q14   │ 1.4 GB         │ 1.9 GB        │ 0.75x   │
│ q15   │ 1.1 GB         │ 1.6 GB        │ 0.67x   │
│ q16   │ 1.6 GB         │ 2.0 GB        │ 0.80x   │
│ q17   │ 3.9 GB         │ 1.2 GB        │ 3.16x   │
│ q18   │ 25.1 GB        │ 28.8 GB       │ 0.87x   │
│ q19   │ 1.9 GB         │ 1.8 GB        │ 1.03x   │
│ q20   │ 5.4 GB         │ 2.4 GB        │ 2.21x   │
│ q21   │ 35.9 GB        │ 39.8 GB       │ 0.90x   │
│ q22   │ 1.4 GB         │ 859.0 MB      │ 1.66x   │
│ Total │ 172.4 GB       │ 189.4 GB      │ 0.91x   │
└───────┴────────────────┴───────────────┴─────────┘

== Top Operators(CPU Time) ==
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Operator             ┃ vanilla_tpch_4  ┃ native_tpch_4 ┃ Speedup ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ ScanFilterAndProject │ 2h 11m 8s 724ms │ 26m 30s 196ms │ 4.00x   │
│ LookupJoin           │ 27m 37s 185ms   │ 6m 55s 359ms  │ 3.00x   │
│ PartitionedOutput    │ 26m 54s 705ms   │ 7m 27s 385ms  │ 3.00x   │
│ Aggregation          │ 24m 46s 606ms   │ 7m 47s 4ms    │ 3.00x   │
│ HashBuilder          │ 6m 42s 143ms    │ 5m 50s 642ms  │ 1.00x   │
│ LocalExchangeSink    │ 5m 22s 39ms     │ 2m 4s 126ms   │ 2.00x   │
│ Exchange             │ 1m 4s 959ms     │ 1m 51s 137ms  │ 0.00x   │
│ LocalExchangeSource  │ 16s 596ms       │ 24s 887ms     │ 0.00x   │
│ OrderBy              │ 3s 392ms        │ 141ms         │ 24.00x  │
│ LocalMergeSource     │ 1s 364ms        │ 97ms          │ 14.00x  │
│ TaskOutput           │ 594ms           │ 0s            │ 0.00x   │
│ Merge                │ 509ms           │ 85ms          │ 5.00x   │
│ TopN                 │ 434ms           │ 93ms          │ 4.00x   │
│ AssignUniqueId       │ 10ms            │               │ 0.00x   │
│ EnforceSingleRow     │                 │               │ 0.00x   │
│ StreamingAggregation │                 │ 0s            │ 0.00x   │
│ CallbackSink         │ 0s              │               │ 0.00x   │
└──────────────────────┴─────────────────┴───────────────┴─────────┘

I have some questions here:

Does this performance numbers make sense, the improvement seems huge, I have not done much tuning to either Native nor Vanilla Presto, so I am not sure whether the huge improvement is due to un-optimized setup or Native is really that good. Do we have official TPCH test result?
ScanFilterAndProject operator improvement is huge, is the expected?
Is LookupJoin's 7x and Aggregation's 3x speedup expected?

I have made some operator name normalization here to make it possible to do an apple to apple operator comparison between Vanilla Presto and Native Presto

FilterAndProject, FilterProject and TableScan all normalized as ScanFilterAndProject

NestedLoopJoin, LookupOuter, HashSemiJoin, HashProbe normalized to LookupJoin

PartialAggregation, HashAggregation normalized to Aggregation

Answered by yingsu00

Jan 21, 2024

The result seems reasonable. However the native scan actually regressed a bit since Feb 2023. We saw even larger improvements in scan in Feb 2023, and observed some regression later in May, but haven't gotten time to root cause since then. You can cross reference our result last year:
Prestissimo Progress and Results.pptx

View full answer

xumingming · 2024-01-21T13:31:12Z

xumingming
Jan 21, 2024
Author

One finding on Q1, the costly stage is the table scan stage which contains both a big table scan, and an Aggregation.

Vanilla

Operator	WallTime	Driver Count
TableScan	24m	357
Aggregation	6.39m	357

Native

Operator	WallTime	Driver Count
TableScan	1m	32
Aggregation	26.7s	32

The tablescan is for lineitem table, which consists of 32 files, each file is about 500MB, I downloaded one of the file, it is consists of 4 rowgroups, so each rowgroup is about 125MB, so it is not a small file/rowgroup.

Vanilla Presto split the table into 357 drivers, while Native Presto split the table into 32 drivers, not sure whether it is a determine reason.

0 replies

yingsu00 · 2024-01-21T14:34:08Z

yingsu00
Jan 21, 2024
Collaborator

The result seems reasonable. However the native scan actually regressed a bit since Feb 2023. We saw even larger improvements in scan in Feb 2023, and observed some regression later in May, but haven't gotten time to root cause since then. You can cross reference our result last year:
Prestissimo Progress and Results.pptx

1 reply

xumingming Jan 22, 2024
Author

@yingsu00 Thanks for the information, very helpful, especially the highlight of why TableScan(Parquet) is faster:

Fully vectorized decoders
RowGroup elimination based on column filters
Filter pushdown to column readers on decoded data from input streams
Filter reordering based on run time stats

mbasmanova · 2024-01-22T22:13:19Z

mbasmanova
Jan 22, 2024
Collaborator

CC: @FelixYBW

0 replies

mbasmanova · 2024-01-22T22:17:10Z

mbasmanova
Jan 22, 2024
Collaborator

@xumingming James, we are seeing something like this.

Both C++ and Java are dominated by TableScan and FilterProject that together use more than 50% of the CPU time. We combine these because we cannot cleanly separate CPU time spent on Scan vs. FilterProject (Scan often includes pushdown filter and Java doesn't provide breakdown of CPU time between Scan and immediately following Project). We observe that Scan + FilterProject in C++ is 3.5x more efficient than Java.

Aggregation is 2nd top operator in Java using 14%, but 4th in C++ with 11%. Aggregation in C++ is 4.5x faster than Java.

TableWriter is the 3rd top operator both in Java and C++ using 13%. C++ version is 3.7 times faster.

PartitionedOutput is the 4th top operator in Java using 7%. It is 2nd in C++ using 13%. C++ implementation is 2x faster. It is still an open question why the PartitionedOutput operator is C++ uses such a large portion of the CPU.

Join is using only 4% in Java and 1.5% in C++. It is 10x faster in C++.

We saw that Aggregation in C++ can be a lot faster because it uses array-based aggregation adaptively. Join can be a lot faster because it uses dynamic filter pushdown adaptively.

1 reply

xumingming Jan 23, 2024
Author

@mbasmanova Thanks for the information and detailed explanation, appreciated!

FelixYBW · 2024-01-22T22:28:04Z

FelixYBW
Jan 22, 2024

FYI, latest time breakdown from Gluten:

0 replies

hackeryang · 2024-07-02T07:12:53Z

hackeryang
Jul 2, 2024

Presto 2.0 blog: https://prestodb.io/blog/2024/06/24/diving-into-the-presto-native-c-query-engine-presto-2-0/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Presto vs Vanilla Presto performance comparison #8305

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Native Presto vs Vanilla Presto performance comparison #8305

xumingming Jan 9, 2024

Replies: 6 comments · 2 replies

xumingming Jan 21, 2024 Author

yingsu00 Jan 21, 2024 Collaborator

xumingming Jan 22, 2024 Author

mbasmanova Jan 22, 2024 Collaborator

mbasmanova Jan 22, 2024 Collaborator

xumingming Jan 23, 2024 Author

FelixYBW Jan 22, 2024

hackeryang Jul 2, 2024

xumingming
Jan 9, 2024

Replies: 6 comments 2 replies

xumingming
Jan 21, 2024
Author

yingsu00
Jan 21, 2024
Collaborator

xumingming Jan 22, 2024
Author

mbasmanova
Jan 22, 2024
Collaborator

mbasmanova
Jan 22, 2024
Collaborator

xumingming Jan 23, 2024
Author

FelixYBW
Jan 22, 2024

hackeryang
Jul 2, 2024