Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Refactor distributed job using common ReplicaSpec #4408

Closed
2 tasks done
troychiu opened this issue Nov 12, 2023 · 2 comments · Fixed by #5355 · May be fixed by flyteorg/flytekit#2424
Closed
2 tasks done

[Core feature] Refactor distributed job using common ReplicaSpec #4408

troychiu opened this issue Nov 12, 2023 · 2 comments · Fixed by #5355 · May be fixed by flyteorg/flytekit#2424
Assignees
Labels
enhancement New feature or request waiting for reporter Used for when we need input from the bug reporter

Comments

@troychiu
Copy link
Member

Motivation: Why do you think this is important?

Right now, different types of distributed jobs such as tensorflow, PyTorch, ... all have their own ReplicaSpec. Based on this discussion thread #4179 (review), we can have a shared ReplicaSpec in common.proto so that all types of distributed jobs can leverage it.

Goal: What should the final outcome look like, ideally?

In common.proto, we have a ReplicaSpec like

# common.proto
message ReplicaSpec {
    // Number of replicas
    int32 replicas = 1;
    
    // Image used for the replica group
    string image = 2;
    
    // Resources required for the replica group
    core.Resources resources = 3;
    
    // Restart policy determines whether pods will be restarted when they exit
    RestartPolicy restart_policy = 4;
    ...
}

and all types of distributed jobs (tensorflow, PyTorch, ray, ...) share it.

Describe alternatives you've considered

Stay what we have now. That is, all types of distributed job have their own ReplicaSpec.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@troychiu troychiu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Nov 12, 2023
@troychiu troychiu self-assigned this Nov 12, 2023
@hamersaw
Copy link
Contributor

IMO there needs to be more discussion here. There are certainly many configuration parameters that will differ between ReplicaSpec for each of the distributed environments, and even within each environment. Is the RestartPolicy you have mocked applied to every plugin? If not, this is another parameter that may result in unintended behavior. Is the proposal to use this common.pb message as a field of each replica spec? For example:

// spark.pb
message SparkWorkerReplicaSpec {
    ReplicaSpec replica_spec = 1;
    
    ... spark sepcific configuration
}

// dask.pb
message DaskScheduler {
   RepliaSpec replica_spec = 1;
   
   ...
}

message DaskWorker {
    ReplicaSpec replica_spec = 1;
    
    ....
}

@eapolinario eapolinario added waiting for reporter Used for when we need input from the bug reporter and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 8, 2023
@MortalHappiness
Copy link
Member

#take

MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 11, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 16, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 17, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 17, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 18, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 21, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 25, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 25, 2024
MortalHappiness added a commit to MortalHappiness/flyte that referenced this issue May 25, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 25, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 26, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue May 26, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 12, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 14, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 16, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 16, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 16, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jun 16, 2024
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
…#5355)

* feat(proto): Define CommonReplicaSpec in common.proto

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>

* chore(proto): Generate new clients corresponding to proto changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>

* feat(replica-spec): Update corresponding golang files based on protobuf changes

Resolves: flyteorg#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>

---------

Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jul 8, 2024
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jul 8, 2024
Resolves: flyteorg/flyte#4408
Signed-off-by: Chi-Sheng Liu <[email protected]>
MortalHappiness added a commit to MortalHappiness/flytekit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request waiting for reporter Used for when we need input from the bug reporter
Projects
None yet
4 participants