-
Notifications
You must be signed in to change notification settings - Fork 123
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes spark-home param, updates docs accordingly
This commit fixes a bug in which spark-home was not being picked up from the configuration file, forcing users to set the environment variable SPARK_HOME. The docs are also updated to include a guide to the parameters in spark-submit-config, including spark-home.
- Loading branch information
Showing
10 changed files
with
195 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
--- | ||
layout: page | ||
title: Spark-Submit Configuration | ||
--- | ||
|
||
Under the hood, Spark-Bench converts users' configuration files into a series of spark-submit scripts. | ||
The spark-submit-config section of the configuration file allows users to change the parameters of | ||
those spark-submits. The `class` and `jar` parameters are set by the spark-bench infrastructure. | ||
|
||
<!-- START doctoc generated TOC please keep comment here to allow auto update --> | ||
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | ||
|
||
- [Parameters](#parameters) | ||
- [spark-home](#spark-home) | ||
- [spark-args](#spark-args) | ||
- [conf](#conf) | ||
- [suites-parallel](#suites-parallel) | ||
|
||
<!-- END doctoc generated TOC please keep comment here to allow auto update --> | ||
|
||
## Parameters | ||
|
||
| Name | Required | Description | | ||
| ---------- | ----- | --- | | ||
| spark-home | no | Path to the top level of your Spark installation | | ||
| spark-args | no | Includes master, executor-memory, and other spark-submit arguments | | ||
| conf | no | A series of configuration options for Spark | | ||
| suites-parallel | no | Whether the workload-suites within this spark-submit should run serially or in parallel. Defaults to `false`. | | ||
|
||
## spark-home | ||
|
||
`spark-home` points to the top level directory of your Spark installation. | ||
|
||
This parameter must either be set by the `spark-home` parameter in the configuration file | ||
or in the environment variable `SPARK_HOME`. If both options or set, the configuration file | ||
wins. This allows users to benchmark more than one installation of Spark within one configuration file. | ||
|
||
## spark-args | ||
|
||
`spark-args` contains a series of key-value pairs that reflect arguments users would normally set in their spark-submit scripts. | ||
|
||
To read more about these options in the context of Spark, see the official documentation: <https://spark.apache.org/docs/latest/submitting-applications.html> | ||
|
||
Probably the most important of these is `master`. Just like in a spark-submit script, users can set master to local, an IP address and port for standalone, or yarn or mesos. | ||
|
||
```hocon | ||
spark-args = { | ||
master = "local[4]" // or local[*], local[2], etc. | ||
} | ||
``` | ||
```hocon | ||
spark-args = { | ||
master = "spark://207.184.161.138:7077" //standalone | ||
} | ||
``` | ||
```hocon | ||
spark-args = { | ||
master = "yarn" | ||
deploy-mode = "cluster" | ||
} | ||
``` | ||
```hocon | ||
spark-args = { | ||
master = "mesos://207.184.161.138:7077" | ||
} | ||
``` | ||
|
||
`master` is the only spark-arg that can also be set in an environment variable. If `SPARK_MASTER_HOST` and `spark-args = { master = ...` | ||
are both set, the configuration file option will win. | ||
|
||
Other spark args include, but are not limited to, `deploy-mode`, `executor-memory`, `num-executors`, `total-executor-cores`, etc. | ||
|
||
## conf | ||
|
||
The `conf` parameter contains a series of pairs of strings representing configuration options for Spark. | ||
These are things that are prefaced by the `--conf` option. | ||
|
||
Example: | ||
```hocon | ||
conf = { | ||
"spark.dynamicAllocation.enabled" = "false" | ||
"spark.shuffle.service.enabled" = "false" | ||
} | ||
``` | ||
|
||
Notice that `conf` is a SERIES of options. To make a list of conf options, users must make a list of objects like so: | ||
```hocon | ||
conf = [ | ||
{ | ||
"spark.dynamicAllocation.enabled" = "false" | ||
"spark.shuffle.service.enabled" = "false" | ||
}, | ||
{ | ||
"spark.dynamicAllocation.enabled" = "true" | ||
"spark.shuffle.service.enabled" = "true" | ||
} | ||
] | ||
``` | ||
This will create two spark-submit scripts that will have the same workload-suites and other parameters, | ||
but the first will have `"spark.dynamicAllocation.enabled" = "false"` and `"spark.shuffle.service.enabled" = "false"`, | ||
and the second will have `"spark.dynamicAllocation.enabled" = "true"` and `"spark.shuffle.service.enabled" = "true"`. | ||
|
||
## suites-parallel | ||
|
||
`suites-parallel` controls whether the workload suites within this spark-submit run serially or in parallel. | ||
This option defaults to `false` meaning the suites will run serially. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
layout: page | ||
title: User's Guide | ||
permalink: /users-guide/ | ||
--- | ||
|
||
<ul> | ||
{% for page in site.users-guide %} | ||
<li> | ||
<h3> | ||
<a class="page-link" href="{{ page.url | relative_url }}">{{ page.title | escape }}</a> | ||
</h3> | ||
</li> | ||
{% endfor %} | ||
</ul> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
45 changes: 45 additions & 0 deletions
45
spark-launch/src/test/resources/etc/specific-spark-home.conf
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
spark-bench = { | ||
spark-submit-config = [{ | ||
spark-home = "/usr/iop/current/spark2-client/" | ||
spark-args = { | ||
master = "yarn" | ||
} | ||
suites-parallel = false | ||
workload-suites = [ | ||
{ | ||
descr = "Generate a dataset, then take that same dataset and write it out to Parquet format" | ||
benchmark-output = "/home/dev-user/emily/results-data-gen.csv" | ||
// We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet. | ||
parallel = false | ||
workloads = [ | ||
{ | ||
name = "data-generation-kmeans" | ||
rows = 100000000 | ||
cols = 24 | ||
output = "hdfs:///tmp/kmeans-data.csv" | ||
}, | ||
{ | ||
name = "sql" | ||
queryStr = "select * from input" | ||
input = "hdfs:///tmp/kmeans-data.csv" | ||
output = "hdfs:///tmp/kmeans-data.parquet" | ||
} | ||
] | ||
}, | ||
{ | ||
descr = "Run two different SQL queries over the dataset in two different formats" | ||
benchmark-output = "/home/dev-user/emily/results-sql.csv" | ||
parallel = false | ||
repeat = 10 | ||
workloads = [ | ||
{ | ||
name = "sql" | ||
input = ["hdfs:///tmp/kmeans-data.csv", "hdfs:///tmp/kmeans-data.parquet"] | ||
queryStr = ["select * from input", "select `0`, `22` from input where `0` < -0.9"] | ||
cache = false | ||
} | ||
] | ||
} | ||
] | ||
}] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters