Spark on kubernetes. Based on official documentation of spark 2.4
- Make (gcc)
- Docker (17+)
- Kubernetes 1.8+
To get a base docker image to use for launch spark on kubernetes type:
make sparknetes-build spark-image
NOTE: This process may take you several minutes (~20 mins, under the wood there is a maven packaging task running). Take a look at Makefile file to view default values and other variables.
This docker image is available at dockerhub/hypnosapos.
Examples will be tested on GKE service, here you have instructions to create a kubernetes cluster).
When we've got our kubernetes cluster ready (for instance with GKE_CLUSTER_NAME=spark
variable exported) we have to prepare a minimal bootstrapping operation:
export GKE_CLUSTER_NAME=spark
make gke-spark-bootstrap
As the picture above shows you, spark-submit
commands will be thrown from a pod of a kubernetes job.
First example is the well known SparkPi:
make spark-basic-example
Logs of jobs may be tracked on this way:
JOB_NAME=<job_name> make gke-job-logs
NOTE: is the name of the example with the suffix '-job' instead of '-example' (i.e. "spark-basic-job" instead of "spark-basic-example")
If it run successfully, spark-submit command should outline something like this:
2018-05-27 14:00:16 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state:
pod name: spark-pi-63ba1a53bc663d728936c24c91fb339b-driver
namespace: default
labels: spark-app-selector -> spark-2a6817ac76a248ba8a9cef7f3b988d82, spark-role -> driver
pod uid: 4698a7b8-61b6-11e8-b653-42010a840124
creation time: 2018-05-27T14:00:13Z
service account name: spark
volumes: spark-token-92jw7
node name: gke-spark-default-pool-ba0e670d-w989
start time: 2018-05-27T14:00:13Z
container images: hypnosapos/spark:2.4
phase: Succeeded
2018-05-27 14:00:16 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
Container name: spark-kubernetes-driver
Container image: hypnosapos/spark:2.4
Container state: Terminated
Exit code: 0
2018-05-27 14:00:16 INFO Client:54 - Application spark-pi finished.
Second example is a linear regression, let's launch the log watcher in line too:
JOB_NAME=spark-ml-job make spark-ml-example gke-job-logs
This example uses a remote dependency for GCS connector and the GCP credentials to authenticate with internal metadata server.
We've used a private jar and class (provide your values directly in Makefile file, quoted by marks < >), but essentially you only need update your code to use gs://
instead the typical hdfs://
scheme for data input/output.
JOB_NAME=spark-gcs-job make spark-gcs-example gke-job-logs
In order to view the driver UI through a public load balance service:
export SPARK_APP_NAME=spark-gcs
make gke-spark-expose-ui
make gke-spark-open-ui
Few months ago google community published the k8s-spark-operator. Thus, it's time to check it out:
make gke-spark-operator-install
make gke-spark-operator-example
Remove all spark resources on kubernetes cluster:
make gke-spark-clean