Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add driver and executor pod template fields to SparkJob #4179

Closed
wants to merge 2 commits into from

Conversation

andrewwdye
Copy link
Contributor

@andrewwdye andrewwdye commented Oct 6, 2023

Tracking issue

#4105

Describe your changes

This change defines a common pattern for per-role pod configuration and adds RoleSpec fields for driver and executor pods, accordingly.

  • Adds a RoleSpec to common.proto with initial fields for pod and pod_template_name. This mirrors the interface we have for container tasks spread across TaskTemplate and TaskMetadata. Fields are optional.
  • Adds driverSpec and executorSpec to SparkJob

This PR was manually migrated from flyteorg/flyteidl#450. See initial comment discussions there.

How should a plugin handle TaskTemplate/TaskMetadata pod templates vs. these RoleSpecs?
For Spark, the task-level pod template is not consumed today and can be treated as an error if set (not supported). The plugin can inspect optional per-role specs for configuration.

For existing plugins that consume the task-level pod template (Dask, MPI, PyTorch, Ray), they can continue doing so. If/when we are ready to introduce per-role pod configuration, we can add RoleSpec fields to the plugin's config, deprecate the task-level pod template, and rely on a runtime check to validate that only task-level pod template OR per-role pod templates are specified.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

wild-endeavor
wild-endeavor previously approved these changes Oct 6, 2023
@wild-endeavor
Copy link
Contributor

@eapolinario i see that one +1 was enough to merge this pr. should we continue to make idl a +2? I think we should.

not sure if there was resolution to @hamersaw's comment.

@eapolinario
Copy link
Contributor

@eapolinario i see that one +1 was enough to merge this pr. should we continue to make idl a +2? I think we should.

not sure if there was resolution to @hamersaw's comment.

As far as I can tell CODEOWNERS does not support that feature (as per https://github.com/orgs/community/discussions/22522). I opened #4182 to track this.

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow the same pattern as tensorflow.proto

like

message SparkJob {
    ...
    sparkReplicaSpec driver = 11;
    sparkReplicaSpec worker = 12;
}

message sparkReplicaSpec {
    oneof PodValue {
        core.K8sPod podTemplate = 1;
    }
    string PodTemplateName = 2;
    string image = 3;
     ...
}

Maybe we should put repliaSpec to common.proto, all other plugins can import this proto in the future, like Ray.

# common.proto
message ReplicaSpec {
    oneof PodValue {
        core.K8sPod podTemplate = 1;
    }
    string PodTemplateName = 2;
    string image = 3;
     ...
}

cc @eapolinario @wild-endeavor @hamersaw wdyt

@andrewwdye
Copy link
Contributor Author

andrewwdye commented Oct 6, 2023

Wanted to also cross reference this comment thread from the original PR, proposing using the TaskTemplate.PodTemplate for all roles, unless specifically overridden with a role-specific template/spec.

@task(
    task_config=Spark(
        spark_conf={
           "spark.driver.memory": "4G",
           "spark.executor.memory": "8G",
        },
        # driver only pod template overrides (takes precedence over top level pod_template)
        driver_pod_template=...
    ),
    # default pod template overrides for BOTH driver and executor
    pod_template=...
)

@andrewwdye
Copy link
Contributor Author

andrewwdye commented Oct 6, 2023

Pulling together references from existing leader/worker patterns for various plugins

  • Dask uses a common spec for scheduler/worker w/ scheduler treated as non-interruptible
  • MPI uses a common pod spec with some per role args
  • PyTorch uses a common pod spec with some per role args
  • Tensorflow uses separate replica specs for each of chief/worker/ps
  • Ray uses a common pod spec for headgroup and worker

It does seem like a good idea to establish a pattern (and shared definitions if appropriate)

@codecov
Copy link

codecov bot commented Oct 8, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (4ee73f5) 59.44% compared to head (8d12d1b) 78.48%.
Report is 4 commits behind head on master.

❗ Current head 8d12d1b differs from pull request most recent head 6fefc0b. Consider uploading reports for the commit 6fefc0b to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4179       +/-   ##
===========================================
+ Coverage   59.44%   78.48%   +19.03%     
===========================================
  Files         637       18      -619     
  Lines       54262     1250    -53012     
===========================================
- Hits        32254      981    -31273     
+ Misses      19467      212    -19255     
+ Partials     2541       57     -2484     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 637 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewwdye
Copy link
Contributor Author

Based on comments from the initial PR thread and above, I've moved pod and pod template name behind a RoleSpec definition in common.proto. See latest PR summary for the pattern this establishes and expected handling of TaskTemplate.K8sPod.

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but is this the right repo @eapolinario ?

Comment on lines +37 to +39
RoleSpec driverSpec = 10;
// The executor spec, used in place of the task's pod template
RoleSpec executorSpec = 11;
Copy link
Contributor

@hamersaw hamersaw Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a scenario where both of these differ from the values set in the task decorator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what you mean here. With this change we will fail

@task(
  pod_template=...
)

in favor of specifying one or both values in

@task(
  task_config=Spark(
    executor_pod_template=...
    driver_pod_template=...
  )
)

@hamersaw
Copy link
Contributor

hamersaw commented Oct 9, 2023

Maybe we should put repliaSpec to common.proto, all other plugins can import this proto in the future, like Ray.

In theory I think this is a great idea. But in practice I would bet there will be corner case plugins where not all options are supported.

Copy link
Contributor Author

@andrewwdye andrewwdye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory I think this is a great idea. But in practice I would bet there will be corner case plugins where not all options are supported

Are you thinking of initial fields in RoleSpec or future ones? If we believe that K8sPod is the correct abstraction for overriding parts of the default pod template, then including it as part of the role spec seems appropriate. I could imagine, however, additional role-specific configuration desired over time that wouldn't fit into a common RoleSpec definition. Thoughts?

Comment on lines +37 to +39
RoleSpec driverSpec = 10;
// The executor spec, used in place of the task's pod template
RoleSpec executorSpec = 11;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what you mean here. With this change we will fail

@task(
  pod_template=...
)

in favor of specifying one or both values in

@task(
  task_config=Spark(
    executor_pod_template=...
    driver_pod_template=...
  )
)

@andrewwdye
Copy link
Contributor Author

Given some of the outstanding questions for Spark, the pattern for multi-role plugins more broadly, and the deprecation/migration story, let me close this out until we get alignment. For now Spark plugin can use the existing TaskTemplate K8sPod field to override tolerations for both driver and executor pods, which is complimentary to the existing interface for node selectors: spark.kubernetes.node.selector... (same values for both roles). This is also consistent with how other plugins use pod templates for multiple roles today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants