-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: dstack
marks instance as terminated
without terminating it
#1551
Comments
This issue is stale because it has been open for 30 days with no activity. |
@jvstme is this issue still relevant? |
@peterschmidt85, yes, I just reproduced it - same behavior |
The same seems to be relevant for TensorDock — an instance was marked as terminated in dstack, yet it is still running in TensorDock. I don't have server logs for this case, but it is known that the TensorDock API is experiencing issues at the moment. UPD: same on GCP |
I think this should be easy to reproduce by simulating a 500 error, no? Could be a test also. Then we ensure that the instance remains |
The current behavior is to notify the admin with logger.error if there is an unexpected error when terminating the instances and mark the instance as terminated. An unexpected error may also happen with instance being actually terminated (a weird backend behavior) so not marking instance as terminated would result in instance shown as running/terminating in dstack, while it's terminated in the backend. dstack Sky would charge users for that. Possibly we can default to non-terminating but there should be an easy way to manually manage such situations such as marking the instance as terminated via the UI. |
Happy to discuss. I'm a bit skeptical that notifying the admin is useful. |
I think it is useful but not enough in simpler open-source setups where server logs can easily be lost or unnoticed. Maybe dstack could perform an additional API request to the backend to verify that the instance was terminated. If it wasn't or this request fails too — retry termination. Similar verification requests could also be useful to avoid marking instances as terminated before they are actually fully terminated in the backend (currently dstack marks instances as terminated immediately after requesting termination, not after termination finishes). This could prevent issues like #1744 and improve observability. |
This issue is stale because it has been open for 30 days with no activity. |
This issue is stale because it has been open for 30 days with no activity. |
@jvstme is this relevant/major? |
Yes, I just reproduced it with AWS (unintentionally, luckily I looked at the logs). This issue looks major to me, as it is likely to lead to unnecessary charges when dstack-server is run without Sentry or other alerting tools. Considering the opinions above, I can suggest the following:
|
Actually, the simplest solution we could start with is to retry termination for 5-10 minutes and only mark the instance as terminated if no attempts were successful. While not ideal, this solution will work for many cases, such as short-term network or backend outages. It is also necessary for Vultr, since Vultr's termination API consistently fails during some intermediate instance states, see this comment. |
dstack
marks instance as terminated
without terminating itterminating
until they are fully terminated
terminating
until they are fully terminateddstack
marks instance as terminated
without terminating it
#2190 adds termination retries, which is enough to handle network or backend outages that don't last longer than ~15 minutes. |
Steps to reproduce
Then wait until the run is
running
and switch off the network ondstack-server
's host.Actual behaviour
The run is marked
failed
, the instance is markedterminated
. However, the instance actually still exists in RunPod and the user is billed for it.Expected behaviour
The instance is not marked
terminated
until it is actually deleted in RunPod.dstack version
master
Server logs
Additional information
I reproduced this issue on RunPod and Vast.ai but not OCI. Maybe the behavior is different for container-based and VM-based backends. On OCI,
dstack
makes many attempts at deleting the instance and only marks itterminated
after succeeding, which is the expected behavior.Ideally, the job also should not be marked
failed
if the connectivity issues are ondstack-server
's side, not on instance's side. But this condition is difficult to detect, so it is out of scope for this issue.The text was updated successfully, but these errors were encountered: