Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detach the managed job controller from job submission #4458

Closed
wants to merge 3 commits into from

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Dec 11, 2024

Previously, the ray driver program as well as a ray worker stayed in use for the entire runtime of a managed job. Now, the job controller will detach from the submitted job/ray driver and continue running in the background.

This means we have to manually manage logging as well as liveness of the controller process. Two new directories are introduced for this purpose as well as plumbing.

  • TODO: limit the parallelism of the controller process (high limit, something like 3x GB of memory) and the parallelism of sky launch (more limited, like 1x GB of memory/4x CPU count)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Previously, the ray driver program as well as a ray worker stayed in use for the
entire runtime of a managed job. Now, the job controller will detach from the
submitted job/ray driver and continue running in the background.

This means we have to manually manage logging as well as liveness of the
controller process. Two new directories are introduced for this purpose as well
as plumbing.
@cg505 cg505 marked this pull request as draft December 11, 2024 01:36
@cg505 cg505 requested a review from Michaelvll December 11, 2024 01:36
@cg505 cg505 force-pushed the detach-managed-job branch from 32a5e36 to 50a3dce Compare December 11, 2024 02:04
@cg505
Copy link
Collaborator Author

cg505 commented Dec 19, 2024

this has major issues after implementing parallelism control, most notably 1) lack of FIFO and 2) each pending job has a process alive with 70MB memory usage
superseded by #4485

@cg505 cg505 closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant