Native Presto vs Vanilla Presto performance comparison #8305
-
I have done a performance test against Native Presto and Vanilla Presto, the basic setup is:
And the result is following:
I have some questions here:
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 2 replies
-
One finding on Q1, the costly stage is the table scan stage which contains both a big table scan, and an Aggregation. Vanilla
Native
The tablescan is for lineitem table, which consists of 32 files, each file is about 500MB, I downloaded one of the file, it is consists of 4 rowgroups, so each rowgroup is about 125MB, so it is not a small file/rowgroup. Vanilla Presto split the table into 357 drivers, while Native Presto split the table into 32 drivers, not sure whether it is a determine reason. |
Beta Was this translation helpful? Give feedback.
-
The result seems reasonable. However the native scan actually regressed a bit since Feb 2023. We saw even larger improvements in scan in Feb 2023, and observed some regression later in May, but haven't gotten time to root cause since then. You can cross reference our result last year: |
Beta Was this translation helpful? Give feedback.
-
CC: @FelixYBW |
Beta Was this translation helpful? Give feedback.
-
@xumingming James, we are seeing something like this. Both C++ and Java are dominated by TableScan and FilterProject that together use more than 50% of the CPU time. We combine these because we cannot cleanly separate CPU time spent on Scan vs. FilterProject (Scan often includes pushdown filter and Java doesn't provide breakdown of CPU time between Scan and immediately following Project). We observe that Scan + FilterProject in C++ is 3.5x more efficient than Java. Aggregation is 2nd top operator in Java using 14%, but 4th in C++ with 11%. Aggregation in C++ is 4.5x faster than Java. TableWriter is the 3rd top operator both in Java and C++ using 13%. C++ version is 3.7 times faster. PartitionedOutput is the 4th top operator in Java using 7%. It is 2nd in C++ using 13%. C++ implementation is 2x faster. It is still an open question why the PartitionedOutput operator is C++ uses such a large portion of the CPU. Join is using only 4% in Java and 1.5% in C++. It is 10x faster in C++. We saw that Aggregation in C++ can be a lot faster because it uses array-based aggregation adaptively. Join can be a lot faster because it uses dynamic filter pushdown adaptively. |
Beta Was this translation helpful? Give feedback.
-
Presto 2.0 blog: https://prestodb.io/blog/2024/06/24/diving-into-the-presto-native-c-query-engine-presto-2-0/ |
Beta Was this translation helpful? Give feedback.
The result seems reasonable. However the native scan actually regressed a bit since Feb 2023. We saw even larger improvements in scan in Feb 2023, and observed some regression later in May, but haven't gotten time to root cause since then. You can cross reference our result last year:
Prestissimo Progress and Results.pptx