Add Vultr Support #2132

Bihan · 2024-12-23T12:07:38Z

No description provided.

src/dstack/_internal/server/services/config.py

src/dstack/_internal/server/migrations/versions/27d3e55759fa_add_pools.py

Bihan · 2025-01-06T09:19:04Z

@jvstme Comments about Vultr VPC(Virtual Private Cloud) implementation.

VPC Lifecycle During Instance Creation
Vultr allows the creation and attachment of a Virtual Private Cloud (VPC) during instance creation. However, if the instance creation fails due to no capacity, the created VPC is not automatically terminated. To address this, the implementation first creates the instance and, upon success, proceeds to create and attach the VPC to ensure proper cleanup and resource management.
Retry Logic for Attaching VPC
Vultr's API for attaching a VPC to an instance does not always succeed on the first attempt. To handle this limitation, a retry mechanism is implemented using a while loop that repeatedly attempts the VPC attachment until it is successful.
Unsupported VPC for Some Bare Metal Instances
Certain Bare Metal plans on Vultr do not support VPC attachment. In such cases, an exception is raised when the VPC attachment fails. As a fallback, instances are created without a VPC, and users are notified via a warning log: "Plan does not support private networking."
Workaround for Unreliable VPC Detachment API
Vultr's API for detaching VPCs from instances is unreliable, with detachment sometimes failing even after 15 minutes of retry. To mitigate this, a workaround checks for the successful termination of the instance before deleting the associated VPC. This is necessary because Vultr does not allow VPC deletion without either detaching it or terminating the instance.

Important:
To handle above limitations and constraints, I have implemented hardcoded error handling as below:

def get_vpc_id(self, instance_id: str, plan_type: str) -> Optional[str]:
        try:
            if plan_type == "bare-metal":
                response = self._make_request("GET", f"/bare-metals/{instance_id}/vpcs")
            else:
                response = self._make_request("GET", f"/instances/{instance_id}/vpcs")

            vpcs = response.json().get("vpcs", [])
            if vpcs:
                return vpcs[0].get("id")
            return None
        except BackendError as e:
            # The above endpoints throws 404, when instance is not available.
            # Propagating this error blocks the execution, therefore need to handle here.
            if "Invalid instance-id" in str(e):
                logger.info(f"Instance {instance_id} not found.")
                return None
            else:
                raise

 def attach_vpc(self, vpc_id: str, instance_id: str, plan_type: str):
        data = {"vpc_id": vpc_id}
        try:
            if plan_type == "bare-metal":
                self._make_request("POST", f"/bare-metals/{instance_id}/vpcs/attach", data=data)
            else:
                self._make_request("POST", f"/instances/{instance_id}/vpcs/attach", data=data)
        except BackendError as e:
            # Sometimes instance is unable to attach VPC and throws 500 error.
            if "Unable to attach VPC" in str(e):
                logger.info("Unable to attach VPC")
            else:
                raise

    def delete_vpc(self, vpc_id: str):
        try:
            self._make_request("DELETE", f"/vpcs/{vpc_id}")
        except BackendError as e:
            if "The following servers are attached to this VPC network" in str(e):
                logger.info(f"Instance is connected to VPC {vpc_id}. Unable to delete VPC.")
                return None
            else:
                raise

def create_instance():
    ...
    ...
        while not self.api_client.get_vpc_id(instance_id, plan_type) and vpc_id is not None:
            time.sleep(1)
            # Vultr's limitation is that we cannot attach VPC without multiple attempts.
            logger.info("Attempting to attach instance to VPC")
            try:
                self.api_client.attach_vpc(vpc_id, instance_id, plan_type)
            except BackendError as e:
                if "plan does not support private networking" in str(e):
                    logger.warning("Plan does not support private networking.")
                    # delete created vpc
                    self.api_client.delete_vpc(vpc_id=vpc_id)
                    vpc_id = None
                else:
                    raise

In above functions, I am checking specific string in error message like "Invalid instance-id" or "plan does not support private networking". This has chances of breakage if Vultr changes the error message.

I am implementing such checks to ensure that only specific errors are handled within these functions and others are raised.

Please suggest, Thank you!

src/dstack/_internal/core/backends/vultr/api_client.py

src/dstack/_internal/core/models/backends/vultr.py

src/dstack/_internal/server/background/tasks/process_instances.py

src/dstack/_internal/server/services/backends/configurators/vultr.py

src/dstack/_internal/core/backends/vultr/compute.py

src/dstack/_internal/core/backends/vultr/api_client.py

jvstme · 2025-01-08T13:26:34Z

@Bihan, I've just published gpuhunt 0.0.18 with Vultr support. I'd be great if you could update the gpuhunt version in setup.py and test this PR with the new version

Bihan · 2025-01-08T13:39:21Z

@Bihan, I've just published gpuhunt 0.0.18 with Vultr support. I'd be great if you could update the gpuhunt version in setup.py and test this PR with the new version.

Tested. It is working

jvstme · 2025-01-08T17:13:25Z

Thanks. Then please push the updated setup.py to this PR later with the rest of the changes

jvstme

@Bihan, I've tested the PR locally and everything works really well! I've only encountered a couple of small issues with some edge cases, please see my comments.

We are almost ready to merge the PR, so it'd also be great if you could resolve the merge conflict in setup.py by rebasing onto master or merging master into the PR branch.

src/dstack/_internal/core/backends/vultr/compute.py

jvstme · 2025-01-09T23:22:14Z

src/dstack/_internal/server/background/tasks/process_instances.py

+    if backend_type == BackendType.VULTR and instance_type_name.startswith("vbm"):
+        return timedelta(seconds=1800)


30 minutes is a huge timeout already, but apparently it's not enough - in my testing, some bare metal instances took almost 40 minutes to start. And I haven't tested H100 and MI300X yet, so maybe some instances will take even longer. Additionally, the timeout (at least the one in process_running_jobs) should account for the pulling of the Docker image, which can also take a few minutes for large images.

Not sure if we can do anything about it but increase the timeout even more. It's important to give the instance enough time to start, otherwise dstack will terminate it, but the user will still be charged for the wasted provisioning time.

I can suggest 55 minutes. Not 60 to avoid being charged for the second hour, since Vultr charges hourly.

@jvstme Just thinking out loud. Should we provide warning to users when provisioning bare-metals? May be we can warn client in the logs that "Vultr might take more than 30 mins for setup".

Yes, I think a message in dstack apply would be informative. Although this seems a bit tricky to implement at the moment. In general, we don't know which instance will be assigned to the job before provisioning starts, it may end up not being bare metal or Vultr. And even after the instance is assigned, the CLI can't tell if it is actually being provisioned or just reused from an existing fleet.

src/dstack/_internal/server/background/tasks/process_running_jobs.py

src/dstack/_internal/core/backends/vultr/api_client.py

Add vultr in backend_types test

This reverts commit 16c3098. The changes are being moved to a separate branch and PR.

Update supported_instance with regex match

jvstme

@Bihan, thank you for the update, the PR looks good to me.

I found one more issue though - quite often instance termination fails if you cancel the provisioning with Ctrl+C shortly after it has started:

[22:32:37] ERROR    dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: This subscription is not currently active, you cannot destroy it.","status":500}')

... or shortly after it has finished:

[22:28:07] ERROR    dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: Unable to remove VM: Server is currently locked","status":500}')

This is pretty critical, since dstack then forgets about the instance while it keeps running in Vultr, possibly causing unexpected charges for the user.

However, I think this should be fixed by adding termination retries in the dstack server code, not in the backend implementation. So I'd suggest to address this issue separately as part of #1551. We'll discuss this with the rest of the team on Monday.

jvstme · 2025-01-13T12:09:13Z

Will add termination retries in #1551

Co-authored-by: Bihan Rana <[email protected]>

Bihan force-pushed the add_vultr branch from d4abe00 to 02e10d0 Compare December 23, 2024 12:24

r4victor reviewed Dec 23, 2024

View reviewed changes

src/dstack/_internal/server/services/config.py Outdated Show resolved Hide resolved

r4victor reviewed Dec 23, 2024

View reviewed changes

src/dstack/_internal/server/migrations/versions/27d3e55759fa_add_pools.py Outdated Show resolved Hide resolved

Bihan requested review from jvstme and r4victor and removed request for jvstme December 23, 2024 13:02

Bihan requested a review from jvstme January 6, 2025 10:17

Bihan commented Jan 6, 2025

View reviewed changes

src/dstack/_internal/core/backends/vultr/api_client.py Outdated Show resolved Hide resolved

jvstme reviewed Jan 7, 2025

View reviewed changes

jvstme reviewed Jan 10, 2025

View reviewed changes

Bihan Rana added 7 commits January 10, 2025 13:25

Add Vultr Support

f7cacad

Add vultr in backend_types test

Add Annotated for regions in VultrConfig

3b1a6cf

Add vultr in postgres_migrations

ec2c838

Add new migration to add vultr to enum

fefa608

Add VPC support to Vultr instances

0621f97

Revert "Add VPC support to Vultr instances"

71755c9

This reverts commit 16c3098. The changes are being moved to a separate branch and PR.

Fix review issues in Vultr integration

0d65823

Update supported_instance with regex match

Bihan force-pushed the add_vultr branch from 70f9008 to 0d65823 Compare January 10, 2025 08:01

jvstme approved these changes Jan 10, 2025

View reviewed changes

jvstme mentioned this pull request Jan 10, 2025

[Bug]: dstack marks instance as terminated without terminating it #1551

Closed

jvstme merged commit e41a29a into dstackai:master Jan 13, 2025
24 checks passed

pranitnaik43 pushed a commit to bahaal-tech/dstack that referenced this pull request Feb 9, 2025

Add Vultr Support (dstackai#2132)

eade19e

Co-authored-by: Bihan Rana <[email protected]>

pranitnaik43 pushed a commit to bahaal-tech/dstack that referenced this pull request Mar 4, 2025

Add Vultr Support (dstackai#2132)

9b00369

Co-authored-by: Bihan Rana <[email protected]>

pranitnaik43 pushed a commit to bahaal-tech/dstack that referenced this pull request Mar 5, 2025

Add Vultr Support (dstackai#2132)

a0a4731

Co-authored-by: Bihan Rana <[email protected]>

jvstme mentioned this pull request Mar 11, 2025

Add script to generate boilerplate code for new backend #2397

Merged

		if backend_type == BackendType.VULTR and instance_type_name.startswith("vbm"):
		return timedelta(seconds=1800)

Add Vultr Support #2132

Add Vultr Support #2132

Uh oh!

Conversation

Bihan commented Dec 23, 2024

Uh oh!

Uh oh!

Uh oh!

Bihan commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jvstme commented Jan 8, 2025

Uh oh!

Bihan commented Jan 8, 2025

Uh oh!

jvstme commented Jan 8, 2025

Uh oh!

jvstme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jvstme Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

Bihan Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

jvstme Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jvstme left a comment

Choose a reason for hiding this comment

Uh oh!

jvstme commented Jan 13, 2025

Uh oh!

Uh oh!

Uh oh!

Bihan commented Jan 6, 2025 •

edited

Loading