diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 67b05ecf7a858..e5af5ae4561c7 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -99,6 +99,7 @@
+
+
spark-submit
can be directly used to submit a Spark application to a Kubernetes cluster.
+The submission mechanism works as follows:
+
+* Spark creates a Spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/).
+* The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
+* When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists
+logs and remains in "completed" state in the Kubernetes API until it's eventually garbage collected or manually cleaned up.
+
+Note that in the completed state, the driver pod does *not* use any computational or memory resources.
+
+The driver and executor pod scheduling is handled by Kubernetes. It will be possible to affect Kubernetes scheduling
+decisions for driver and executor pods using advanced primitives like
+[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector)
+and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity)
+in a future release.
+
+# Submitting Applications to Kubernetes
+
+## Docker Images
+
+Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to
+be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is
+frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized
+and built for your usage.
+
+You may build these docker images from sources.
+There is a script, `sbin/build-push-docker-images.sh` that you can use to build and push
+customized Spark distribution images consisting of all the above components.
+
+Example usage is:
+
+ ./sbin/build-push-docker-images.sh -r spark-submit
by specifying
+`--master k8s://http://127.0.0.1:6443` as an argument to spark-submit. Additionally, it is also possible to use the
+authenticating proxy, `kubectl proxy` to communicate to the Kubernetes API.
+
+The local proxy can be started by:
+
+```bash
+kubectl proxy
+```
+
+If the local proxy is running at localhost:8001, `--master k8s://http://127.0.0.1:8001` can be used as the argument to
+spark-submit. Finally, notice that in the above example we specify a jar with a specific URI with a scheme of `local://`.
+This URI is the location of the example jar that is already in the Docker image.
+
+## Dependency Management
+
+If your application's dependencies are all hosted in remote locations like HDFS or HTTP servers, they may be referred to
+by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images.
+Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the
+`SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles.
+
+## Introspection and Debugging
+
+These are the different ways in which you can investigate a running/completed Spark application, monitor progress, and
+take actions.
+
+### Accessing Logs
+
+Logs can be accessed using the Kubernetes API and the `kubectl` CLI. When a Spark application is running, it's possible
+to stream logs from the application using:
+
+```bash
+kubectl -n=Property Name | Default | Meaning |
---|---|---|
spark.kubernetes.namespace |
+ default |
+ + The namespace that will be used for running the driver and executor pods. + | +
spark.kubernetes.driver.container.image |
+ (none) |
+ + Container image to use for the driver. + This is usually of the form `example.com/repo/spark-driver:v1.0.0`. + This configuration is required and must be provided by the user. + | +
spark.kubernetes.executor.container.image |
+ (none) |
+ + Container image to use for the executors. + This is usually of the form `example.com/repo/spark-executor:v1.0.0`. + This configuration is required and must be provided by the user. + | +
spark.kubernetes.container.image.pullPolicy |
+ IfNotPresent |
+ + Container image pull policy used when pulling images within Kubernetes. + | +
spark.kubernetes.allocation.batch.size |
+ 5 |
+ + Number of pods to launch at once in each round of executor pod allocation. + | +
spark.kubernetes.allocation.batch.delay |
+ 1s |
+ + Time to wait between each round of executor pod allocation. Specifying values less than 1 second may lead to + excessive CPU usage on the spark driver. + | +
spark.kubernetes.authenticate.submission.caCertFile |
+ (none) | ++ Path to the CA cert file for connecting to the Kubernetes API server over TLS when starting the driver. This file + must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide + a scheme). + | +
spark.kubernetes.authenticate.submission.clientKeyFile |
+ (none) | ++ Path to the client key file for authenticating against the Kubernetes API server when starting the driver. This file + must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide + a scheme). + | +
spark.kubernetes.authenticate.submission.clientCertFile |
+ (none) | ++ Path to the client cert file for authenticating against the Kubernetes API server when starting the driver. This + file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not + provide a scheme). + | +
spark.kubernetes.authenticate.submission.oauthToken |
+ (none) | ++ OAuth token to use when authenticating against the Kubernetes API server when starting the driver. Note + that unlike the other authentication options, this is expected to be the exact string value of the token to use for + the authentication. + | +
spark.kubernetes.authenticate.submission.oauthTokenFile |
+ (none) | ++ Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server when starting the driver. + This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not + provide a scheme). + | +
spark.kubernetes.authenticate.driver.caCertFile |
+ (none) | ++ Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting + executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. + Specify this as a path as opposed to a URI (i.e. do not provide a scheme). + | +
spark.kubernetes.authenticate.driver.clientKeyFile |
+ (none) | ++ Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting + executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. + Specify this as a path as opposed to a URI (i.e. do not provide a scheme). If this is specified, it is highly + recommended to set up TLS for the driver submission server, as this value is sensitive information that would be + passed to the driver pod in plaintext otherwise. + | +
spark.kubernetes.authenticate.driver.clientCertFile |
+ (none) | ++ Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when + requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the + driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). + | +
spark.kubernetes.authenticate.driver.oauthToken |
+ (none) | ++ OAuth token to use when authenticating against the Kubernetes API server from the driver pod when + requesting executors. Note that unlike the other authentication options, this must be the exact string value of + the token to use for the authentication. This token value is uploaded to the driver pod. If this is specified, it is + highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would + be passed to the driver pod in plaintext otherwise. + | +
spark.kubernetes.authenticate.driver.oauthTokenFile |
+ (none) | ++ Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server from the driver pod when + requesting executors. Note that unlike the other authentication options, this file must contain the exact string value of + the token to use for the authentication. This token value is uploaded to the driver pod. If this is specified, it is + highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would + be passed to the driver pod in plaintext otherwise. + | +
spark.kubernetes.authenticate.driver.mounted.caCertFile |
+ (none) | ++ Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting + executors. This path must be accessible from the driver pod. + Specify this as a path as opposed to a URI (i.e. do not provide a scheme). + | +
spark.kubernetes.authenticate.driver.mounted.clientKeyFile |
+ (none) | ++ Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting + executors. This path must be accessible from the driver pod. + Specify this as a path as opposed to a URI (i.e. do not provide a scheme). + | +
spark.kubernetes.authenticate.driver.mounted.clientCertFile |
+ (none) | ++ Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when + requesting executors. This path must be accessible from the driver pod. + Specify this as a path as opposed to a URI (i.e. do not provide a scheme). + | +
spark.kubernetes.authenticate.driver.mounted.oauthTokenFile |
+ (none) | ++ Path to the file containing the OAuth token to use when authenticating against the Kubernetes API server from the driver pod when + requesting executors. This path must be accessible from the driver pod. + Note that unlike the other authentication options, this file must contain the exact string value of the token to use for the authentication. + | +
spark.kubernetes.authenticate.driver.serviceAccountName |
+ default |
+ + Service account that is used when running the driver pod. The driver pod uses this service account when requesting + executor pods from the API server. Note that this cannot be specified alongside a CA cert file, client key file, + client cert file, and/or OAuth token. + | +
spark.kubernetes.driver.label.[LabelName] |
+ (none) | +
+ Add the label specified by LabelName to the driver pod.
+ For example, spark.kubernetes.driver.label.something=true .
+ Note that Spark also adds its own labels to the driver pod
+ for bookkeeping purposes.
+ |
+
spark.kubernetes.driver.annotation.[AnnotationName] |
+ (none) | +
+ Add the annotation specified by AnnotationName to the driver pod.
+ For example, spark.kubernetes.driver.annotation.something=true .
+ |
+
spark.kubernetes.executor.label.[LabelName] |
+ (none) | +
+ Add the label specified by LabelName to the executor pods.
+ For example, spark.kubernetes.executor.label.something=true .
+ Note that Spark also adds its own labels to the driver pod
+ for bookkeeping purposes.
+ |
+
spark.kubernetes.executor.annotation.[AnnotationName] |
+ (none) | +
+ Add the annotation specified by AnnotationName to the executor pods.
+ For example, spark.kubernetes.executor.annotation.something=true .
+ |
+
spark.kubernetes.driver.pod.name |
+ (none) | ++ Name of the driver pod. If not set, the driver pod name is set to "spark.app.name" suffixed by the current timestamp + to avoid name conflicts. + | +
spark.kubernetes.executor.podNamePrefix |
+ (none) | ++ Prefix for naming the executor pods. + If not set, the executor pod name is set to driver pod name suffixed by an integer. + | +
spark.kubernetes.executor.lostCheck.maxAttempts |
+ 10 |
+ + Number of times that the driver will try to ascertain the loss reason for a specific executor. + The loss reason is used to ascertain whether the executor failure is due to a framework or an application error + which in turn decides whether the executor is removed and replaced, or placed into a failed state for debugging. + | +
spark.kubernetes.submission.waitAppCompletion |
+ true |
+ + In cluster mode, whether to wait for the application to finish before exiting the launcher process. When changed to + false, the launcher has a "fire-and-forget" behavior when launching the Spark job. + | +
spark.kubernetes.report.interval |
+ 1s |
+ + Interval between reports of the current Spark job status in cluster mode. + | +
spark.kubernetes.driver.limit.cores |
+ (none) | ++ Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. + | +
spark.kubernetes.executor.limit.cores |
+ (none) | ++ Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. + | +
spark.kubernetes.node.selector.[labelKey] |
+ (none) | +
+ Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the
+ configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier
+ will result in the driver pod and executors having a node selector with key identifier and value
+ myIdentifier . Multiple node selector keys can be added by setting multiple configurations with this prefix.
+ |
+
spark.kubernetes.driverEnv.[EnvironmentVariableName] |
+ (none) | +
+ Add the environment variable specified by EnvironmentVariableName to
+ the Driver process. The user can specify multiple of these to set multiple environment variables.
+ |
+
spark.kubernetes.mountDependencies.jarsDownloadDir |
+ /var/spark-data/spark-jars |
+ + Location to download jars to in the driver and executors. + This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + | +
spark.kubernetes.mountDependencies.filesDownloadDir |
+ /var/spark-data/spark-files |
+ + Location to download jars to in the driver and executors. + This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + | +
client
or cluster
mode depending on the value of --deploy-mode
.
The cluster location will be found based on the HADOOP_CONF_DIR
or YARN_CONF_DIR
variable.
+k8s://HOST:PORT
cluster
mode. Client mode is currently unsupported and will be supported in future releases.
+ The HOST
and PORT
refer to the [Kubernetes API Server](https://kubernetes.io/docs/reference/generated/kube-apiserver/).
+ It connects using TLS by default. In order to force it to use an unsecured connection, you can use
+ k8s://http://HOST:PORT
.
+