-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Fix adding initial version and cleanup service versions #3804
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @cblmemo! Left two quesetions.
# If the `services` and `version_specs` table are not aligned, it might | ||
# result in a None service status. In this case, the controller process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write down in what case the two tables will be misaligned if we have always handled the two database at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the system added an initial version in the __init__
function of replica managers and added the service record here
Lines 146 to 151 in 9e27a81
success = serve_state.add_service( | |
service_name, | |
controller_job_id=job_id, | |
policy=service_spec.autoscaling_policy_str(), | |
requested_resources_str=backend_utils.get_task_resources_str(task), | |
status=serve_state.ServiceStatus.CONTROLLER_INIT) |
If the code is aborted before adding the version and after adding the service, it will cause a misalignment. In this PR we move the adding of initial version closer to the service table, to reduce the possibility of causing such misalignment, though arguably it is still 'possible' to happed, so I added a safe guard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of abortion would be shutil
lib call to fail, e.g. permission issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the previous issue mainly because the clean up not deleting the final version from the version_spec
table, which is fixed by the delete_all_versions
now?
If this safe guard is only for rare case, it would be great to mention that this is rare in the comment here to make sure we still know whether this is a common case in the future. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for not making it clear. The previous issue is the misalignment of service
and version_spec
, which causes the _get_service_status
to return a None
(due to the joining of the two tables), and causes the assertion to abort the deletion of the service.
skypilot/sky/serve/serve_utils.py
Lines 415 to 417 in 9e27a81
service_status = _get_service_status(service_name, | |
with_replica_info=False) | |
assert service_status is not None, service_name |
I found that we did not clean up the remaining version when I debugged the original issue, and fix that in the same PR along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this safe guard is only for rare case, it would be great to mention that this is rare in the comment here to make sure we still know whether this is a common case in the future. : )
Good point! Added a comment for this. Thanks!
# Add initial version information to the service state. | ||
serve_state.add_or_update_version(service_name, constants.INITIAL_VERSION, | ||
service_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is this movement related to the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please see the comment above 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @cblmemo! LGTM.
Originally, in our serve state,
version_specs
in the__init__
function ofReplicaManager
, which is not attached to insertion to theservices
table. This could sometimes lead to a misalign of the two tables;This PR fixed the two problems.
TODO: test.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh