Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Continue storage deletion in sky storage delete when some fail #4454

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

andylizf
Copy link
Contributor

Fix #4050.

Make sky storage delete continue deleting remaining storages when some fail, instead of aborting the entire operation. Failed deletions are reported individually.

We Implemented this feature via adding a continue_on_error parameter in run_in_parallel

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@andylizf
Copy link
Contributor Author

@cblmemo PTAL, thanks!

@andylizf
Copy link
Contributor Author

andylizf commented Dec 10, 2024

I inserted five new records into the 'global_user_state' database: three records containing the previous storage handle value, and two records with null handle values.

andyl@DESKTOP-7FP6SMO ~/skypilot (fix-storage-recover) [1]> sky storage dele
te -a
Deleting 6 storages: skypilot-workdir-andyl-74d418df, test-storage-1, test-storage-2, test-storage-3, test-storage-4, test-storage-5. Proceed? [Y/n]: 
Error deleting storage test-storage-3: In global_user_state, storage name 'test-storage-3' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-2: In global_user_state, storage name 'test-storage-2' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-4: Storage name 'test-storage-4' not found.
Error deleting storage test-storage-1: In global_user_state, storage name 'test-storage-1' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-5: Storage name 'test-storage-5' not found.
Deleted GCS bucket skypilot-workdir-andyl-74d418df.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @andylizf ! left some discussions.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
"""
# Reference: https://stackoverflow.com/questions/25790279/python-multiprocessing-early-termination # pylint: disable=line-too-long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not removing this reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why remove the reference? I think it's trivial now, and the reference doesn't add much value to the code.

def run_in_parallel(func: Callable,
args: Iterable[Any],
num_threads: Optional[int] = None) -> List[Any]:
num_threads: Optional[int] = None,
continue_on_error: bool = False) -> List[Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this seems change the return value type from Value to Union[Value, Exception], which looks a little bit strange to me. Why do we need this anyway? aren't we already add the try-except in the func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought about not adding a function to catch and log these exceptions, but instead handle all exceptions in run_in_parallel. Anyway, it made sense to have this function have an argument like continue_on_error.
But then I realized it violated the abstraction concept and would make it harder to print custom logs. So I ended up wrapping the function and logging it.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andylizf ! Mostly looks good to me. After those nits it should be ready to go!

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated
Comment on lines 3548 to 3551
except (exceptions.StorageBucketDeleteError, PermissionError) as e:
click.secho(f'Error deleting storage {name}: {e}', fg='red')
except ValueError as e:
click.secho(f'Error deleting storage {name}: {e}', fg='red')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except (exceptions.StorageBucketDeleteError, PermissionError) as e:
click.secho(f'Error deleting storage {name}: {e}', fg='red')
except ValueError as e:
click.secho(f'Error deleting storage {name}: {e}', fg='red')
except (exceptions.StorageBucketDeleteError, PermissionError, ValueError) as e:
click.secho(f'Error deleting storage {name}: {e}', fg='red')

can we merge them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, seems like here is one exception that is not being caught. After a second thought, for simplicity, we can probably just do broad except here..

(skypilot) (base) ➜  skypilot git:(remove_token) ✗ sky storage delete -a                                                   
Deleting 15 storages: skypilot-workdir-windsey-afa32d67, skypilot-workdir-windsey-81d2115f, skypilot-workdir-windsey-b8fd0382, skypilot-workdir-windsey-499178bb, skypilot-workdir-windsey-dd64cc00, skypilot-workdir-windsey-d04e17a8, skypilot-workdir-windsey-43925bf5, skypilot-workdir-windsey-3375424a, skypilot-workdir-windsey-5e824f17, skypilot-workdir-windsey-db755382, skypilot-workdir-windsey-25270ec6, skypilot-workdir-windsey-1cb4689a, skypilot-workdir-windsey-f40df433, skypilot-workdir-windsey-2ef8a0bf, skypilot-workdir-windsey-58878e24. Proceed? [Y/n]: 
No backing stores found. Deleting storage.
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden

The above exception was the direct cause of the following exception:

sky.exceptions.StorageBucketGetError: Failed to access existing bucket 'skypilot-workdir-windsey-afa32d67'. This is likely because it is a private bucket you do not have access to.
To fix: 
  1. If you are trying to create a new bucket: use a different name.
  2. If you are trying to connect to an existing bucket: make sure your cloud credentials have access to it. To debug, consider running `aws s3 ls skypilot-workdir-windsey-afa32d67`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I agree

sky/cli.py Outdated
Comment on lines 3552 to 3555
except Exception as e: # pylint: disable=broad-except
with ux_utils.print_exception_no_traceback():
raise e

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except Exception as e: # pylint: disable=broad-except
with ux_utils.print_exception_no_traceback():
raise e

It seems like we dont need it? We can simply not except it. And any exceptions raised by sky.storage_delete should already have the surpress traceback context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why they should have the traceback context supressed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cuz this is a user-facing API and we want to keep the error message simple. The content in the error message should be self-explanatory.

@cblmemo cblmemo changed the title Continue storage deletion when some fail [Storage] Continue storage deletion in sky storage delete when some fail Dec 19, 2024
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andylizf ! LGTM. Please lmk when you pass the test and it should be ready to go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Storage] sky storage delete -a aborted when deletion of one storage failed
2 participants