How can you use existing custom Spark images with the Spark app and Deps built into the image with your Spark K8S Operator? #418

mmm-micro · 2024-06-24T16:01:27Z

mmm-micro
Jun 24, 2024

Hi, I'm taking a look at the your Spark Operator as a possible alternative to the old "GoogleCloud" one.

We have existing images that include the Spark App binaries and deps in the images. What is the best way to get them working with this Operator?

I had a look/play and I found this example:
https://docs.stackable.tech/home/stable/spark-k8s/usage-guide/examples#_pyspark_externally_located_dataset_artifact_available_via_pvcvolume_mount

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-image
  namespace: default
spec:
  image: docker.stackable.tech/stackable/ny-tlc-report:0.1.0
  sparkImage:
    productVersion: 3.5.1
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py
  args:
    - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  deps:
    requirements:
      - tabulate==0.8.9
  sparkConf:
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  job:
    config:
      resources:
        cpu:
          min: "1"
          max: "1"
        memory:
          limit: "1Gi"
  driver:
    config:
      resources:
        cpu:
          min: "1"
          max: "1500m"
        memory:
          limit: "1Gi"
  executor:
    replicas: 3
    config:
      resources:
        cpu:
          min: "1"
          max: "4"
        memory:
          limit: "2Gi"

Which makes it look like you can have a different image for the spark submit and the workers and executors.

So, I gave a simple example a go:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-pi-docker-hub-3.5.1
  namespace: spark-apps
spec:
  image: docker.io/spark:3.5.1
  sparkImage:
    productVersion: 3.5.1
  mode: cluster
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  driver:
    config:
      resources:
        cpu:
          min: "1"
          max: "2"
        memory:
          limit: "1Gi"
  executor:
    replicas: 1
    config:
      resources:
        cpu:
          min: "1"
          max: "2"
        memory:
          limit: "1Gi"
...

I was hoping it would just try to run the official Spark image's Pi app. But it fails on the driver:

pyspark-pi-docker-hub-3-5-1-53eea4904ae15e7a-driver   0/1     Init:Error   0          4s

Defaulted container "spark" out of: spark, job (init)
Error from server (BadRequest): container "spark" in pod "pyspark-pi-docker-hub-3-5-1-53eea4904ae15e7a-driver" is waiting to start: PodInitializing

Should something like this work, or do we have to use an image based off a stackable image?

We tend to include the Spark apps in the image rather than pulling from S3.

To use this Operator, would have to build off a Stackable base image if we still want to include our apps and deps in the images we run?

Answered by adwk67

Jun 24, 2024

So to summarise, the same image is used for job/driver/executor, but there are different ways of preparing this image:

if you have your own repository, you can take one of our base images and add dependencies to that as needed (edit: as @razvan mentions above, this is the best practice)
if not, you can use a S3 bucket (recommended) or a PVC (not recommended) to make these resources available.

We also have an issue for using HDFS in place of S3, but that has not been planned yet.

View full answer

adwk67 · 2024-06-24T16:13:11Z

adwk67
Jun 24, 2024
Collaborator

There are a couple of things to note here: first, the image field is different to sparkImage and is only used to copy resources to the actual sparkImage, which is then used for the driver/executors. Second, this is fairly inflexible as the resources will be copied to exactly /jobs on the spark image (this was originally in the documentation, but seems to have been dropped).

0 replies

razvan · 2024-06-24T16:18:06Z

razvan
Jun 24, 2024
Collaborator

I was hoping it would just try to run the official Spark image's Pi app. But it fails on the driver:

The Stackable Spark imge is built from this Dockerfile. The spark operator assumes the structure given by this file.

Which makes it look like you can have a different image for the spark submit and the workers and executors.

This is not supported. The spec.image property applies to all pods created by the operator.

We tend to include the Spark apps in the image rather than pulling from S3.
To use this Operator, would have to build off a Stackable base image if we still want to include our apps and deps in the images we run?

Yes, this is actually the best practice.

0 replies

adwk67 · 2024-06-24T16:18:17Z

adwk67
Jun 24, 2024
Collaborator

So to summarise, the same image is used for job/driver/executor, but there are different ways of preparing this image:

if you have your own repository, you can take one of our base images and add dependencies to that as needed (edit: as @razvan mentions above, this is the best practice)
if not, you can use a S3 bucket (recommended) or a PVC (not recommended) to make these resources available.

We also have an issue for using HDFS in place of S3, but that has not been planned yet.

1 reply

mmm-micro Jun 25, 2024
Author

Thank you for the help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can you use existing custom Spark images with the Spark app and Deps built into the image with your Spark K8S Operator? #418

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How can you use existing custom Spark images with the Spark app and Deps built into the image with your Spark K8S Operator? #418

mmm-micro Jun 24, 2024

Replies: 3 comments · 1 reply

adwk67 Jun 24, 2024 Collaborator

razvan Jun 24, 2024 Collaborator

adwk67 Jun 24, 2024 Collaborator

mmm-micro Jun 25, 2024 Author

mmm-micro
Jun 24, 2024

Replies: 3 comments 1 reply

adwk67
Jun 24, 2024
Collaborator

razvan
Jun 24, 2024
Collaborator

adwk67
Jun 24, 2024
Collaborator

mmm-micro Jun 25, 2024
Author