-
Hi, I'm taking a look at the your Spark Operator as a possible alternative to the old "GoogleCloud" one. We have existing images that include the Spark App binaries and deps in the images. What is the best way to get them working with this Operator? I had a look/play and I found this example: ---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-image
namespace: default
spec:
image: docker.stackable.tech/stackable/ny-tlc-report:0.1.0
sparkImage:
productVersion: 3.5.1
mode: cluster
mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py
args:
- "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
deps:
requirements:
- tabulate==0.8.9
sparkConf:
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
job:
config:
resources:
cpu:
min: "1"
max: "1"
memory:
limit: "1Gi"
driver:
config:
resources:
cpu:
min: "1"
max: "1500m"
memory:
limit: "1Gi"
executor:
replicas: 3
config:
resources:
cpu:
min: "1"
max: "4"
memory:
limit: "2Gi" Which makes it look like you can have a different image for the spark submit and the workers and executors. So, I gave a simple example a go: ---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: pyspark-pi-docker-hub-3.5.1
namespace: spark-apps
spec:
image: docker.io/spark:3.5.1
sparkImage:
productVersion: 3.5.1
mode: cluster
mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
driver:
config:
resources:
cpu:
min: "1"
max: "2"
memory:
limit: "1Gi"
executor:
replicas: 1
config:
resources:
cpu:
min: "1"
max: "2"
memory:
limit: "1Gi"
... I was hoping it would just try to run the official Spark image's Pi app. But it fails on the driver:
Should something like this work, or do we have to use an image based off a stackable image? We tend to include the Spark apps in the image rather than pulling from S3. To use this Operator, would have to build off a Stackable base image if we still want to include our apps and deps in the images we run? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
There are a couple of things to note here: first, the |
Beta Was this translation helpful? Give feedback.
-
The Stackable Spark imge is built from this Dockerfile. The spark operator assumes the structure given by this file.
This is not supported. The
Yes, this is actually the best practice. |
Beta Was this translation helpful? Give feedback.
-
So to summarise, the same image is used for job/driver/executor, but there are different ways of preparing this image:
We also have an issue for using HDFS in place of S3, but that has not been planned yet. |
Beta Was this translation helpful? Give feedback.
So to summarise, the same image is used for job/driver/executor, but there are different ways of preparing this image:
We also have an issue for using HDFS in place of S3, but that has not been planned yet.