Benchmarks to compare Spark SQL API and Spark raw SQL #11

arnaudframmery · 2022-01-05T13:34:31Z

The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time

The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents
It's about 570 MB and 1.5 million lines.

The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	3469,058 ± 24,771 ms
raw SQL	302,646 ± 9,981 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	2992,163 ± 19,237 ms
raw SQL	521,539 ± 15,745 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	4103,849 ± 123,673 ms
raw SQL	4084,750 ± 72,480 ms

List query with conditions

Method	Time
SQL API	9688,635 ± 64,259 ms
raw SQL	4376,306 ± 65,468 ms

Aggregation query + show

Method	Time
SQL API	9527,319 ± 82,806 ms
raw SQL	3556,624 ± 36,587 ms

arnaudframmery · 2022-01-05T14:53:19Z

List query Comparison in function of limit value

We can notice a strange behaviour when limit equal -1 or not. This can be explained by the fact that in function of limit value, the SQL API used will not be the same in our code (we use between() if limit != -1 and geq() otherwise)

Here is the Flamegraph of the SQL API call with limit = 100 000

Here is the Flamegraph of the SQL API call with limit = -1

=> We can see that the function calls are not the same between these two graphs

Nevertheless, the Flamegraph of the raw SQL keeps the same whatever the limit value

OPeyrusse · 2022-01-06T10:26:19Z

You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences.
I found this doc about Spark UI : https://spark.apache.org/docs/3.0.0-preview/web-ui.html but I haven't found a way to access it from your tests yet

arnaudframmery · 2022-01-12T15:24:23Z

With the databricks cluster

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	833,819 ± 151,347 ms
raw SQL	956,818 ± 203,762 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	833,815 ± 165,537 ms
raw SQL	1759,694 ± 924,455 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	2745,061 ± 1392,481 ms
raw SQL	5848,527 ± 5638,261 ms

Take into consideration that the errors are quite huge !
A good idea would be to take the time provided by Spark UI for the requests execution only

arnaudframmery · 2022-01-13T09:57:08Z

When we are looking at the time given by Spark UI (in the Databricks cluster)

List query with limit and offset (limit = 100 000, offset = 0)

Method	Time
SQL API	0.45 s
raw SQL	0.31 s

List query with limit and offset (limit = 100 000, offset = 100 000)

Method	Time
SQL API	0.47 s
raw SQL	0.22 s

List query with limit and offset (limit = -1, offset = 100 000)

Method	Time
SQL API	0.5 s
raw SQL	0.5 s

arnaudframmery · 2022-01-13T10:04:40Z

Here are the query plans generated by Spark

SQL API (same for all sets of param)

raw SQL (limit = 100 000, offset = 0)

SQL API (limit = 100 000, offset = 100 000)

SQL API (limit = -1, offset = 100 000)

arnaudframmery · 2022-01-26T09:35:56Z

The new values for the implementation of conditions and aggregations with Scala operators

Running on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.

1 node = 4 cores & 14 GB
3 warmups and 10 iterations

The query is like the following :
SELECT Severity, COUNT(*) AS severity_count FROM us_accidents WHERE Crossing = "true" GROUP BY Severity

Dataset with 1.5 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	0.57 s	0.51 s	1.12
raw SQL	0.56 s	0.48 s	1.17

Dataset with 15 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	5 s	3 s	1.67
raw SQL	5 s	2.7 s	1.85

Dataset with 150 millions lines (47 columns)

Method	1 node	3 nodes	Speed up
SQL API	13.9 s	5 s	2.78
raw SQL	14 s	4.8 s	2.92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Benchmarks to compare Spark SQL API and Spark raw SQL #11

arnaudframmery commented Jan 5, 2022 •

edited

Loading

arnaudframmery commented Jan 5, 2022

OPeyrusse commented Jan 6, 2022

arnaudframmery commented Jan 12, 2022 •

edited

Loading

arnaudframmery commented Jan 13, 2022

arnaudframmery commented Jan 13, 2022

arnaudframmery commented Jan 26, 2022 •

edited

Loading

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Comments

arnaudframmery commented Jan 5, 2022 • edited Loading

List query with limit and offset (limit = 100 000, offset = 0)

List query with limit and offset (limit = 100 000, offset = 100 000)

List query with limit and offset (limit = -1, offset = 100 000)

List query with conditions

Aggregation query + show

arnaudframmery commented Jan 5, 2022

List query Comparison in function of limit value

OPeyrusse commented Jan 6, 2022

arnaudframmery commented Jan 12, 2022 • edited Loading

List query with limit and offset (limit = 100 000, offset = 0)

List query with limit and offset (limit = 100 000, offset = 100 000)

List query with limit and offset (limit = -1, offset = 100 000)

arnaudframmery commented Jan 13, 2022

List query with limit and offset (limit = 100 000, offset = 0)

List query with limit and offset (limit = 100 000, offset = 100 000)

List query with limit and offset (limit = -1, offset = 100 000)

arnaudframmery commented Jan 13, 2022

SQL API (same for all sets of param)

raw SQL (limit = 100 000, offset = 0)

SQL API (limit = 100 000, offset = 100 000)

SQL API (limit = -1, offset = 100 000)

arnaudframmery commented Jan 26, 2022 • edited Loading

The new values for the implementation of conditions and aggregations with Scala operators

Dataset with 1.5 millions lines (47 columns)

Dataset with 15 millions lines (47 columns)

Dataset with 150 millions lines (47 columns)

arnaudframmery commented Jan 5, 2022 •

edited

Loading

arnaudframmery commented Jan 12, 2022 •

edited

Loading

arnaudframmery commented Jan 26, 2022 •

edited

Loading