Skip to content
This repository has been archived by the owner on May 18, 2022. It is now read-only.

Benchmarks to compare Spark SQL API and Spark raw SQL #11

Open
arnaudframmery opened this issue Jan 5, 2022 · 6 comments
Open

Benchmarks to compare Spark SQL API and Spark raw SQL #11

arnaudframmery opened this issue Jan 5, 2022 · 6 comments

Comments

@arnaudframmery
Copy link
Contributor

arnaudframmery commented Jan 5, 2022

The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time

The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents
It's about 570 MB and 1.5 million lines.

The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 3469,058 ± 24,771 ms
raw SQL 302,646 ± 9,981 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 2992,163 ± 19,237 ms
raw SQL 521,539 ± 15,745 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 4103,849 ± 123,673 ms
raw SQL 4084,750 ± 72,480 ms

List query with conditions

Method Time
SQL API 9688,635 ± 64,259 ms
raw SQL 4376,306 ± 65,468 ms

Aggregation query + show

Method Time
SQL API 9527,319 ± 82,806 ms
raw SQL 3556,624 ± 36,587 ms
@arnaudframmery
Copy link
Contributor Author

List query Comparison in function of limit value

We can notice a strange behaviour when limit equal -1 or not. This can be explained by the fact that in function of limit value, the SQL API used will not be the same in our code (we use between() if limit != -1 and geq() otherwise)

Here is the Flamegraph of the SQL API call with limit = 100 000
sql_api_limit_100000_offset_100000

Here is the Flamegraph of the SQL API call with limit = -1
sql_api_limit_-1_offset_100000

=> We can see that the function calls are not the same between these two graphs

Nevertheless, the Flamegraph of the raw SQL keeps the same whatever the limit value
raw_sql_limit_100000_offset_100000

@OPeyrusse
Copy link
Contributor

You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences.
I found this doc about Spark UI : https://spark.apache.org/docs/3.0.0-preview/web-ui.html but I haven't found a way to access it from your tests yet

@arnaudframmery
Copy link
Contributor Author

arnaudframmery commented Jan 12, 2022

With the databricks cluster

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 833,819 ± 151,347 ms
raw SQL 956,818 ± 203,762 ms

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 833,815 ± 165,537 ms
raw SQL 1759,694 ± 924,455 ms

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 2745,061 ± 1392,481 ms
raw SQL 5848,527 ± 5638,261 ms

Take into consideration that the errors are quite huge !
A good idea would be to take the time provided by Spark UI for the requests execution only

@arnaudframmery
Copy link
Contributor Author

When we are looking at the time given by Spark UI (in the Databricks cluster)

List query with limit and offset (limit = 100 000, offset = 0)

Method Time
SQL API 0.45 s
raw SQL 0.31 s

List query with limit and offset (limit = 100 000, offset = 100 000)

Method Time
SQL API 0.47 s
raw SQL 0.22 s

List query with limit and offset (limit = -1, offset = 100 000)

Method Time
SQL API 0.5 s
raw SQL 0.5 s

@arnaudframmery
Copy link
Contributor Author

Here are the query plans generated by Spark

SQL API (same for all sets of param)

Databricks Shell - Details for Query 25

raw SQL (limit = 100 000, offset = 0)

Databricks Shell - Details for Query 105 (1)

SQL API (limit = 100 000, offset = 100 000)

Databricks Shell - Details for Query 131

SQL API (limit = -1, offset = 100 000)

Databricks Shell - Details for Query 153

@arnaudframmery
Copy link
Contributor Author

arnaudframmery commented Jan 26, 2022

The new values for the implementation of conditions and aggregations with Scala operators

Running on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.

  • 1 node = 4 cores & 14 GB
  • 3 warmups and 10 iterations

The query is like the following :
SELECT Severity, COUNT(*) AS severity_count FROM us_accidents WHERE Crossing = "true" GROUP BY Severity

Dataset with 1.5 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 0.57 s 0.51 s 1.12
raw SQL 0.56 s 0.48 s 1.17

Dataset with 15 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 5 s 3 s 1.67
raw SQL 5 s 2.7 s 1.85

Dataset with 150 millions lines (47 columns)

Method 1 node 3 nodes Speed up
SQL API 13.9 s 5 s 2.78
raw SQL 14 s 4.8 s 2.92

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants