Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AZURE] Job and cluster terminated due to a Runtime error after 2 days running #4589

Open
rafox2005 opened this issue Jan 18, 2025 · 0 comments

Comments

@rafox2005
Copy link

Versions:
skypilot==0.7.0
skypilot-nightly==1.0.0.dev20250107

Description:
Job and cluster was terminated by Skypilot without any retry due to this runtime error. The controller free disk space, memory and CPU resources are fine.

(noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:41:12 utils.py:95] === Checking the job status... === (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:41:12 utils.py:101] Job status: JobStatus.RUNNING (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:41:12 utils.py:104] ================================== (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:41:43 common_utils.py:404] Caught Failed to parse status from Azure response: None.. Retrying. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:42:21 common_utils.py:404] Caught Failed to parse status from Azure response: None.. Retrying. (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] Traceback (most recent call last): (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 1791, in _query_cluster_status_via_cloud_api (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] node_status_dict = provision_lib.query_instances( (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return f(*args, **kwargs) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 52, in _wrapper (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return impl(*args, **kwargs) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 400, in method_with_retries (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return method(*args, **kwargs) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/provision/azure/instance.py", line 984, in query_instances (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] p.starmap(_fetch_and_map_status, (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/multiprocessing/pool.py", line 375, in starmap (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return self._map_async(func, iterable, starmapstar, chunksize).get() (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/multiprocessing/pool.py", line 774, in get (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] raise self._value (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/multiprocessing/pool.py", line 125, in worker (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] result = (True, func(*args, **kwds)) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return list(itertools.starmap(args[0], args[1])) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/provision/azure/instance.py", line 976, in _fetch_and_map_status (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] status = _get_instance_status(compute_client, node, resource_group) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/provision/azure/instance.py", line 740, in _get_instance_status (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return AzureInstanceStatus.from_raw_states(provisioning_state, None) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/provision/azure/instance.py", line 128, in from_raw_states (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] raise exceptions.ClusterStatusFetchingError( (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] sky.exceptions.ClusterStatusFetchingError: Failed to parse status from Azure response: None. (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] During handling of the above exception, another exception occurred: (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] Traceback (most recent call last): (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 369, in run (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] succeeded = self._run_one_task(task_id, task) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 273, in _run_one_task (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] handle) = backend_utils.refresh_cluster_status_handle( (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return f(*args, **kwargs) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 2328, in refresh_cluster_status_handle (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] record = refresh_cluster_record( (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 2290, in refresh_cluster_record (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] return _update_cluster_status_no_lock(cluster_name) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 1959, in _update_cluster_status_no_lock (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] node_statuses = _query_cluster_status_via_cloud_api(handle) (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend_utils.py", line 1799, in _query_cluster_status_via_cloud_api (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] raise exceptions.ClusterStatusFetchingError( (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] sky.exceptions.ClusterStatusFetchingError: Failed to query Azure cluster 'noleak-yolov5mblob-150-6l-73' status: [sky.exceptions.ClusterStatusFetchingError] Failed to parse status from Azure response: None. (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:394] (noleak_yolov5mblob_150102025, pid=1267637) E 01-18 00:42:54 controller.py:397] Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query Azure cluster 'noleak-yolov5mblob-150-6l-73' status: [sky.exceptions.ClusterStatusFetchingError] Failed to parse status from Azure response: None. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:42:54 state.py:480] Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query Azure cluster 'noleak-yolov5mblob-150-6l-73' status: [sky.exceptions.ClusterStatusFetchingError] Failed to parse status from Azure response: None. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:42:56 controller.py:523] Killing controller process 1267707. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:42:56 controller.py:531] Controller process 1267707 killed. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:42:56 controller.py:533] Cleaning up any cluster for job 73. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:43:01 storage.py:645] Verifying bucket for storage test-bucket (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:43:01 storage.py:997] Storage type StoreType.AZURE already exists under storage account 'sky63566309a1c8c949'. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) W 01-18 00:43:06 task.py:153] Docker login configs SKYPILOT_DOCKER_PASSWORD, SKYPILOT_DOCKER_SERVER, SKYPILOT_DOCKER_USERNAME are provided, but no docker image is specified in image_id. The login configs will be ignored. (noleak_yolov5mblob_150102025, pid=1267637) I 01-18 00:43:19 controller.py:542] Cluster of managed job 73 has been cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant