-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmarks to compare Spark SQL API and Spark raw SQL #11
Comments
You could also try to access Spark UI to have access to Spark query plans. It can help understanding the plan, what are the differences. |
With the databricks cluster List query with limit and offset (limit = 100 000, offset = 0)
List query with limit and offset (limit = 100 000, offset = 100 000)
List query with limit and offset (limit = -1, offset = 100 000)
Take into consideration that the errors are quite huge ! |
When we are looking at the time given by Spark UI (in the Databricks cluster) List query with limit and offset (limit = 100 000, offset = 0)
List query with limit and offset (limit = 100 000, offset = 100 000)
List query with limit and offset (limit = -1, offset = 100 000)
|
The new values for the implementation of conditions and aggregations with Scala operatorsRunning on the Databricks cluster (1 node or 3 nodes) and figures are given by the Spark UI.
The query is like the following : Dataset with 1.5 millions lines (47 columns)
Dataset with 15 millions lines (47 columns)
Dataset with 150 millions lines (47 columns)
|
The benchmarks are done with JMH with 3 warmups iterations and then 10 iterations with the mode single shot time
The dataset used is this one : https://www.kaggle.com/sobhanmoosavi/us-accidents
It's about 570 MB and 1.5 million lines.
The hardware is for now a laptop with a Ryzen 4500H and 20 Go of RAM
List query with limit and offset (limit = 100 000, offset = 0)
List query with limit and offset (limit = 100 000, offset = 100 000)
List query with limit and offset (limit = -1, offset = 100 000)
List query with conditions
Aggregation query + show
The text was updated successfully, but these errors were encountered: