Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Show nicer errors if ssh jump pod fails #3261

Closed
romilbhardwaj opened this issue Mar 1, 2024 · 3 comments
Closed

[k8s] Show nicer errors if ssh jump pod fails #3261

romilbhardwaj opened this issue Mar 1, 2024 · 3 comments
Labels
k8s Kubernetes related items Stale

Comments

@romilbhardwaj
Copy link
Collaborator

If the underlying ssh jump pod fails (e.g., node fails, something goes wrong), sky exec fails with:

Traceback (most recent call last):
  File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 350, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1197, in invoke
    return super().invoke(ctx)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1611, in exec
    sky.exec(task, backend=backend, cluster_name=cluster, detach_run=detach_run)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 595, in exec
    return _execute(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 350, in _execute
    job_id = backend.execute(handle,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 350, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/backend.py", line 99, in execute
    return self._execute(handle, task, detach_run, dryrun)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 3287, in _execute
    job_id = self._add_job(handle, task_copy.name, resources_str)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 3238, in _add_job
    subprocess_utils.handle_returncode(returncode, code,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/subprocess_utils.py", line 91, in handle_returncode
    raise exceptions.CommandError(returncode, command, format_err_msg,
sky.exceptions.CommandError: Command python3 -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_owner_kwargs = {} if getattr(constants, "SKYLET_LIB_VERSION", 0) >= 1 else {"job_owner": getpass.getuser()};job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'romilb'"'"', '"'"'sky-2024-03-01-15-25-57-679007'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255.

Running another sky launch fixes it since the ssh jump pod is recovered.

However, we should print cleaner error messages in the event of such failures, possibly suggesting running sky launch again to reinitialize the ssh jump pod if it is not detected.

@romilbhardwaj romilbhardwaj added the k8s Kubernetes related items label Mar 1, 2024
Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jun 30, 2024
@romilbhardwaj
Copy link
Collaborator Author

SSH jump pod is removed in #3657, can be closed with that PR.

@romilbhardwaj
Copy link
Collaborator Author

Closed with #3657.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items Stale
Projects
None yet
Development

No branches or pull requests

1 participant