You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the underlying ssh jump pod fails (e.g., node fails, something goes wrong), sky exec fails with:
Traceback (most recent call last):
File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
sys.exit(cli())
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 350, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1197, in invoke
return super().invoke(ctx)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1611, in exec
sky.exec(task, backend=backend, cluster_name=cluster, detach_run=detach_run)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 595, in exec
return _execute(
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 350, in _execute
job_id = backend.execute(handle,
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 371, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 350, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/backend.py", line 99, in execute
return self._execute(handle, task, detach_run, dryrun)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 3287, in _execute
job_id = self._add_job(handle, task_copy.name, resources_str)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 3238, in _add_job
subprocess_utils.handle_returncode(returncode, code,
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/subprocess_utils.py", line 91, in handle_returncode
raise exceptions.CommandError(returncode, command, format_err_msg,
sky.exceptions.CommandError: Command python3 -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_owner_kwargs = {} if getattr(constants, "SKYLET_LIB_VERSION", 0) >= 1 else {"job_owner": getpass.getuser()};job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'romilb'"'"', '"'"'sky-2024-03-01-15-25-57-679007'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255.
Running another sky launch fixes it since the ssh jump pod is recovered.
However, we should print cleaner error messages in the event of such failures, possibly suggesting running sky launch again to reinitialize the ssh jump pod if it is not detected.
The text was updated successfully, but these errors were encountered:
If the underlying ssh jump pod fails (e.g., node fails, something goes wrong),
sky exec
fails with:Running another
sky launch
fixes it since the ssh jump pod is recovered.However, we should print cleaner error messages in the event of such failures, possibly suggesting running
sky launch
again to reinitialize the ssh jump pod if it is not detected.The text was updated successfully, but these errors were encountered: