-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Vultr Support #2132
Add Vultr Support #2132
Conversation
src/dstack/_internal/server/migrations/versions/27d3e55759fa_add_pools.py
Outdated
Show resolved
Hide resolved
@jvstme Comments about Vultr VPC(Virtual Private Cloud) implementation.
Important:
In above functions, I am checking specific string in error message like "Invalid instance-id" or "plan does not support private networking". This has chances of breakage if Vultr changes the error message. I am implementing such checks to ensure that only specific errors are handled within these functions and others are raised. Please suggest, Thank you! |
src/dstack/_internal/server/background/tasks/process_instances.py
Outdated
Show resolved
Hide resolved
src/dstack/_internal/server/services/backends/configurators/vultr.py
Outdated
Show resolved
Hide resolved
@Bihan, I've just published gpuhunt 0.0.18 with Vultr support. I'd be great if you could update the gpuhunt version in |
Tested. It is working |
Thanks. Then please push the updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bihan, I've tested the PR locally and everything works really well! I've only encountered a couple of small issues with some edge cases, please see my comments.
We are almost ready to merge the PR, so it'd also be great if you could resolve the merge conflict in setup.py
by rebasing onto master or merging master into the PR branch.
if backend_type == BackendType.VULTR and instance_type_name.startswith("vbm"): | ||
return timedelta(seconds=1800) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 minutes is a huge timeout already, but apparently it's not enough - in my testing, some bare metal instances took almost 40 minutes to start. And I haven't tested H100 and MI300X yet, so maybe some instances will take even longer. Additionally, the timeout (at least the one in process_running_jobs
) should account for the pulling of the Docker image, which can also take a few minutes for large images.
Not sure if we can do anything about it but increase the timeout even more. It's important to give the instance enough time to start, otherwise dstack will terminate it, but the user will still be charged for the wasted provisioning time.
I can suggest 55 minutes. Not 60 to avoid being charged for the second hour, since Vultr charges hourly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvstme Just thinking out loud. Should we provide warning to users when provisioning bare-metals? May be we can warn client in the logs that "Vultr might take more than 30 mins for setup".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think a message in dstack apply
would be informative. Although this seems a bit tricky to implement at the moment. In general, we don't know which instance will be assigned to the job before provisioning starts, it may end up not being bare metal or Vultr. And even after the instance is assigned, the CLI can't tell if it is actually being provisioned or just reused from an existing fleet.
src/dstack/_internal/server/background/tasks/process_running_jobs.py
Outdated
Show resolved
Hide resolved
Add vultr in backend_types test
This reverts commit 16c3098. The changes are being moved to a separate branch and PR.
Update supported_instance with regex match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bihan, thank you for the update, the PR looks good to me.
I found one more issue though - quite often instance termination fails if you cancel the provisioning with Ctrl+C shortly after it has started:
[22:32:37] ERROR dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: This subscription is not currently active, you cannot destroy it.","status":500}')
... or shortly after it has finished:
[22:28:07] ERROR dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: Unable to remove VM: Server is currently locked","status":500}')
This is pretty critical, since dstack then forgets about the instance while it keeps running in Vultr, possibly causing unexpected charges for the user.
However, I think this should be fixed by adding termination retries in the dstack server code, not in the backend implementation. So I'd suggest to address this issue separately as part of #1551. We'll discuss this with the rest of the team on Monday.
Will add termination retries in #1551 |
No description provided.