-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Spark Task tolerations not applied with PodTemplate #4378
Comments
FYI @andrewwdye. I'll test it later today. |
I just reran the steps from the test plan at #4183 and can confirm that the pod template toleration is added to the pod |
What do you think is happening on my side? @andrewwdye @pingsutw I will add some code on top to help reproduce my issues. |
@tionichm , how are you deploying Flyte? Are you using the official helm charts? Can you double-check which version you're using? Also, can you confirm which version of Spark you're using? |
@eapolinario Sorry for only getting back to you now. I'll share what information I have, sure. My setup:
Values file: configuration:
database:
password: *****
host: *****
dbname: flyte
username: postgres
port: 5432
storage:
metadataContainer: *****
userDataContainer: *****
provider: s3
providerConfig:
s3:
region: af-south-1
authType: iam
auth:
enabled: false
inline:
cluster_resources:
customData:
- production:
- defaultIamRole:
value: *****
- staging:
- defaultIamRole:
value: *****
- development:
- defaultIamRole:
value: *****
flyteadmin:
roleNameKey: "iam.amazonaws.com/role"
plugins:
k8s:
inject-finalizer: true
default-env-vars:
- AWS_METADATA_SERVICE_TIMEOUT: 5
- AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
spark:
spark-config-default:
- spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.DefaultAWSCredentialsProviderChain
- spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
- spark.hadoop.fs.s3a.acl.default: BucketOwnerFullControl
- spark.hadoop.fs.s3n.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.hadoop.fs.AbstractFileSystem.s3n.impl: org.apache.hadoop.fs.s3a.S3A
- spark.hadoop.fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.hadoop.fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3a.S3A
- spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
- spark.hadoop.fs.AbstractFileSystem.s3a.impl: org.apache.hadoop.fs.s3a.S3A
- spark.hadoop.fs.s3a.multipart.threshold: "536870912"
- spark.task.maxfailures: "8"
- spark.kubernetes.allocation.batch.size: "50"
- spark.kubernetes.node.selector.app: "flyte"
storage:
cache:
max_size_mbs: 10
target_gc_percent: 100
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- K8S-ARRAY
- spark
default-for-task-types:
- container: container
- container_array: K8S-ARRAY
- spark: spark
deployment:
extraPodSpec:
nodeSelector:
app: flyte
tolerations:
- key: "app"
operator: "Equal"
value: "flyte"
effect: "NoSchedule"
clusterResourceTemplates:
inline:
001_namespace.yaml: |
apiVersion: v1
kind: Namespace
metadata:
name: '{{ namespace }}'
# 002_serviceaccount.yaml: |
# apiVersion: v1
# kind: ServiceAccount
# metadata:
# name: flyte-runner
# namespace: '{{ namespace }}'
# annotations:
# eks.amazonaws.com/role-arn: '{{ defaultIamRole }}'
010_spark_role.yaml: |
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: flyte-spark-role
namespace: '{{ namespace }}'
rules:
- apiGroups:
- ""
resources:
- '*'
verbs:
- '*'
011_spark_service_account.yaml: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: '{{ namespace }}'
annotations:
eks.amazonaws.com/role-arn: '{{ defaultIamRole }}'
012_spark_role_binding.yaml: |
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: flyte-spark-role-binding
namespace: '{{ namespace }}'
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: flyte-spark-role
subjects:
- kind: ServiceAccount
name: spark
namespace: '{{ namespace }}'
ingress:
create: true
commonAnnotations:
kubernetes.io/ingress.class: nginx
httpAnnotations:
nginx.ingress.kubernetes.io/app-root: /console
grpcAnnotations:
nginx.ingress.kubernetes.io/backend-protocol: GRPC
host: *****
tls:
- hosts:
- *****
secretName: flyte.deploy-tls-secret
rbac:
extraRules:
- apiGroups:
- ""
resources:
- '*'
verbs:
- "*"
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- '*'
- apiGroups:
- rbac.authorization.k8s.io
resources:
- '*'
verbs:
- '*'
- apiGroups:
- sparkoperator.k8s.io
resources:
- sparkapplications
verbs:
- "*"
serviceAccount:
create: true
name: flyte-sa
annotations:
eks.amazonaws.com/role-arn: ***** |
Found the problem. Our spark operator for k8s deployment was missing the mutating admission webhook, which is a hard requirement for this to work. We enabled it on the helm chart and set the webhook port to 443. |
@tionichm this is great to hear! Thanks so much for updating this ticket, it really helps future users debug problems. Is it reasonable to close this now? |
Describe the bug
We're running flyte binary on EKS. Everything is working at this point, including spark tasks. Recently it has become a requirement to move all flyte-related to a specific node group with a label and taint combo.
Tolerations for spark tasks were added in this PR (PR#4183). I've spent some time trying to implement exactly what is shown in the PR, and I have had no success.
Note, the node selector portion of this is done using spark configs and not PodTemplate, and it works as expected.
I can confirm that the following task works as expected
yields
for
hello_world_no_spark
andfor
hello_world_spark
On creation, the manifests look like this:
hello_world_no_spark
hello_world_spark
Expected behavior
Spark
tasks that are registered with tolerations viaPodTemplate
are expected to have the same tolerations applied to the driver and executor pods when the task is executed.Additional context to reproduce
app=flyte:NoSchedule
app=flyte
.spark.kubernetes.node.selector.app: "flyte"
Spark
task using the following:app=flyte
label.Screenshots
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: