Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: dstack marks instance as terminated without terminating it #1551

Closed
jvstme opened this issue Aug 14, 2024 · 15 comments
Closed

[Bug]: dstack marks instance as terminated without terminating it #1551

jvstme opened this issue Aug 14, 2024 · 15 comments
Assignees
Labels
bug Something isn't working major

Comments

@jvstme
Copy link
Collaborator

jvstme commented Aug 14, 2024

Steps to reproduce

> cat .dstack.yml 
type: dev-environment
ide: vscode

> dstack apply --spot-auto -b runpod -y

Then wait until the run is running and switch off the network on dstack-server's host.

Actual behaviour

The run is marked failed, the instance is marked terminated. However, the instance actually still exists in RunPod and the user is billed for it.

Expected behaviour

The instance is not marked terminated until it is actually deleted in RunPod.

dstack version

master

Server logs

[09:21:18] DEBUG    dstack._internal.core.services.ssh.tunnel:73 SSH tunnel failed: b'ssh: connect to host 194.68.245.18 port 22056: Network is                
                    unreachable\r\n'                                                                                                                           
I0000 00:00:1723620079.263387 1605744 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork
[09:21:19] DEBUG    dstack._internal.core.services.ssh.tunnel:73 SSH tunnel failed: b'ssh: connect to host 194.68.245.18 port 22056: Network is                
                    unreachable\r\n'                                                                                                                           
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:259 job(e3ec13)polite-starfish-1-0-0: failed because runner is not available 
                    or return an error,  age=0:03:00.121137                                                                                                    
           INFO     dstack._internal.server.background.tasks.process_runs:338 run(5dd434)polite-starfish-1: run status has changed RUNNING -> TERMINATING      
[09:21:21] DEBUG    dstack._internal.server.services.jobs:238 job(e3ec13)polite-starfish-1-0-0: stopping container                                             
           INFO     dstack._internal.server.services.jobs:269 job(e3ec13)polite-starfish-1-0-0: instance 'polite-starfish-1-0' has been released, new status is
                    TERMINATING                                                                                                                                
           INFO     dstack._internal.server.services.jobs:286 job(e3ec13)polite-starfish-1-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY       
[09:21:22] INFO     dstack._internal.server.services.runs:932 run(5dd434)polite-starfish-1: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED   
[09:21:23] ERROR    dstack._internal.server.background.tasks.process_instances:763 Got exception when terminating instance polite-starfish-1-0                 
                    Traceback (most recent call last):                                                                                                         

[... long stack trace ...]
                                                                                            
                    requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.runpod.io', port=443): Max retries exceeded with url:                   
                    /graphql?api_key=***** (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at     
                    0x7f5fb5a98a90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))                                  
           INFO     dstack._internal.server.background.tasks.process_instances:773 Instance polite-starfish-1-0 terminated

Additional information

I reproduced this issue on RunPod and Vast.ai but not OCI. Maybe the behavior is different for container-based and VM-based backends. On OCI, dstack makes many attempts at deleting the instance and only marks it terminated after succeeding, which is the expected behavior.

Ideally, the job also should not be marked failed if the connectivity issues are on dstack-server's side, not on instance's side. But this condition is difficult to detect, so it is out of scope for this issue.

@jvstme jvstme added the bug Something isn't working label Aug 14, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Sep 20, 2024
@peterschmidt85
Copy link
Contributor

@jvstme is this issue still relevant?

@jvstme
Copy link
Collaborator Author

jvstme commented Sep 20, 2024

@peterschmidt85, yes, I just reproduced it - same behavior

@jvstme jvstme removed the stale label Sep 20, 2024
@jvstme
Copy link
Collaborator Author

jvstme commented Oct 17, 2024

The same seems to be relevant for TensorDock — an instance was marked as terminated in dstack, yet it is still running in TensorDock. I don't have server logs for this case, but it is known that the TensorDock API is experiencing issues at the moment.

UPD: same on GCP

@peterschmidt85
Copy link
Contributor

I think this should be easy to reproduce by simulating a 500 error, no? Could be a test also. Then we ensure that the instance remains terminating and eventually dstack can ensure that the instance is terminated (e.g. either finished on loud side or doesn't exist anymore)

@r4victor
Copy link
Collaborator

The current behavior is to notify the admin with logger.error if there is an unexpected error when terminating the instances and mark the instance as terminated. An unexpected error may also happen with instance being actually terminated (a weird backend behavior) so not marking instance as terminated would result in instance shown as running/terminating in dstack, while it's terminated in the backend. dstack Sky would charge users for that. Possibly we can default to non-terminating but there should be an easy way to manually manage such situations such as marking the instance as terminated via the UI.

@peterschmidt85
Copy link
Contributor

Happy to discuss. I'm a bit skeptical that notifying the admin is useful.

@dstackai dstackai deleted a comment from peterschmidt85 Oct 17, 2024
@jvstme
Copy link
Collaborator Author

jvstme commented Oct 17, 2024

I think it is useful but not enough in simpler open-source setups where server logs can easily be lost or unnoticed.

Maybe dstack could perform an additional API request to the backend to verify that the instance was terminated. If it wasn't or this request fails too — retry termination.

Similar verification requests could also be useful to avoid marking instances as terminated before they are actually fully terminated in the backend (currently dstack marks instances as terminated immediately after requesting termination, not after termination finishes). This could prevent issues like #1744 and improve observability.

Copy link

This issue is stale because it has been open for 30 days with no activity.

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Dec 30, 2024
@peterschmidt85
Copy link
Contributor

@jvstme is this relevant/major?

@jvstme
Copy link
Collaborator Author

jvstme commented Dec 30, 2024

Yes, I just reproduced it with AWS (unintentionally, luckily I looked at the logs).

This issue looks major to me, as it is likely to lead to unnecessary charges when dstack-server is run without Sentry or other alerting tools.

Considering the opinions above, I can suggest the following:

  1. In case of instance termination errors, keep the instance in terminating and retry termination indefinitely.
  2. To prevent instances from getting stuck in terminating while they are actually terminated, do one or both of the following:
    • allow users to manually mark instances as terminated in UI or CLI;
    • make additional requests to the backend to determine if the instance still exists.

@jvstme jvstme added major and removed stale labels Dec 30, 2024
@jvstme
Copy link
Collaborator Author

jvstme commented Jan 10, 2025

Actually, the simplest solution we could start with is to retry termination for 5-10 minutes and only mark the instance as terminated if no attempts were successful. While not ideal, this solution will work for many cases, such as short-term network or backend outages.

It is also necessary for Vultr, since Vultr's termination API consistently fails during some intermediate instance states, see this comment.

@jvstme jvstme self-assigned this Jan 13, 2025
@peterschmidt85 peterschmidt85 changed the title [Bug]: dstack marks instance as terminated without terminating it [Bug]: Show instances as terminating until they are fully terminated Jan 13, 2025
@r4victor
Copy link
Collaborator

@jvstme Shouldn't it be closed by #2190?

@jvstme jvstme changed the title [Bug]: Show instances as terminating until they are fully terminated [Bug]: dstack marks instance as terminated without terminating it Jan 21, 2025
@jvstme
Copy link
Collaborator Author

jvstme commented Jan 21, 2025

#2190 adds termination retries, which is enough to handle network or backend outages that don't last longer than ~15 minutes.

@jvstme jvstme closed this as completed Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

3 participants