Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Vultr Support #2132

Merged
merged 7 commits into from
Jan 13, 2025
Merged

Add Vultr Support #2132

merged 7 commits into from
Jan 13, 2025

Conversation

Bihan
Copy link
Collaborator

@Bihan Bihan commented Dec 23, 2024

No description provided.

@Bihan Bihan requested review from jvstme and r4victor and removed request for jvstme December 23, 2024 13:02
@Bihan
Copy link
Collaborator Author

Bihan commented Jan 6, 2025

@jvstme Comments about Vultr VPC(Virtual Private Cloud) implementation.

  1. VPC Lifecycle During Instance Creation
    Vultr allows the creation and attachment of a Virtual Private Cloud (VPC) during instance creation. However, if the instance creation fails due to no capacity, the created VPC is not automatically terminated. To address this, the implementation first creates the instance and, upon success, proceeds to create and attach the VPC to ensure proper cleanup and resource management.

  2. Retry Logic for Attaching VPC
    Vultr's API for attaching a VPC to an instance does not always succeed on the first attempt. To handle this limitation, a retry mechanism is implemented using a while loop that repeatedly attempts the VPC attachment until it is successful.

  3. Unsupported VPC for Some Bare Metal Instances
    Certain Bare Metal plans on Vultr do not support VPC attachment. In such cases, an exception is raised when the VPC attachment fails. As a fallback, instances are created without a VPC, and users are notified via a warning log: "Plan does not support private networking."

  4. Workaround for Unreliable VPC Detachment API
    Vultr's API for detaching VPCs from instances is unreliable, with detachment sometimes failing even after 15 minutes of retry. To mitigate this, a workaround checks for the successful termination of the instance before deleting the associated VPC. This is necessary because Vultr does not allow VPC deletion without either detaching it or terminating the instance.

Important:
To handle above limitations and constraints, I have implemented hardcoded error handling as below:

def get_vpc_id(self, instance_id: str, plan_type: str) -> Optional[str]:
        try:
            if plan_type == "bare-metal":
                response = self._make_request("GET", f"/bare-metals/{instance_id}/vpcs")
            else:
                response = self._make_request("GET", f"/instances/{instance_id}/vpcs")

            vpcs = response.json().get("vpcs", [])
            if vpcs:
                return vpcs[0].get("id")
            return None
        except BackendError as e:
            # The above endpoints throws 404, when instance is not available.
            # Propagating this error blocks the execution, therefore need to handle here.
            if "Invalid instance-id" in str(e):
                logger.info(f"Instance {instance_id} not found.")
                return None
            else:
                raise
 def attach_vpc(self, vpc_id: str, instance_id: str, plan_type: str):
        data = {"vpc_id": vpc_id}
        try:
            if plan_type == "bare-metal":
                self._make_request("POST", f"/bare-metals/{instance_id}/vpcs/attach", data=data)
            else:
                self._make_request("POST", f"/instances/{instance_id}/vpcs/attach", data=data)
        except BackendError as e:
            # Sometimes instance is unable to attach VPC and throws 500 error.
            if "Unable to attach VPC" in str(e):
                logger.info("Unable to attach VPC")
            else:
                raise
    def delete_vpc(self, vpc_id: str):
        try:
            self._make_request("DELETE", f"/vpcs/{vpc_id}")
        except BackendError as e:
            if "The following servers are attached to this VPC network" in str(e):
                logger.info(f"Instance is connected to VPC {vpc_id}. Unable to delete VPC.")
                return None
            else:
                raise
def create_instance():
    ...
    ...
        while not self.api_client.get_vpc_id(instance_id, plan_type) and vpc_id is not None:
            time.sleep(1)
            # Vultr's limitation is that we cannot attach VPC without multiple attempts.
            logger.info("Attempting to attach instance to VPC")
            try:
                self.api_client.attach_vpc(vpc_id, instance_id, plan_type)
            except BackendError as e:
                if "plan does not support private networking" in str(e):
                    logger.warning("Plan does not support private networking.")
                    # delete created vpc
                    self.api_client.delete_vpc(vpc_id=vpc_id)
                    vpc_id = None
                else:
                    raise

In above functions, I am checking specific string in error message like "Invalid instance-id" or "plan does not support private networking". This has chances of breakage if Vultr changes the error message.

I am implementing such checks to ensure that only specific errors are handled within these functions and others are raised.

Please suggest, Thank you!

@Bihan Bihan requested a review from jvstme January 6, 2025 10:17
@jvstme
Copy link
Collaborator

jvstme commented Jan 8, 2025

@Bihan, I've just published gpuhunt 0.0.18 with Vultr support. I'd be great if you could update the gpuhunt version in setup.py and test this PR with the new version

@Bihan
Copy link
Collaborator Author

Bihan commented Jan 8, 2025

@Bihan, I've just published gpuhunt 0.0.18 with Vultr support. I'd be great if you could update the gpuhunt version in setup.py and test this PR with the new version.

Tested. It is working

@jvstme
Copy link
Collaborator

jvstme commented Jan 8, 2025

Thanks. Then please push the updated setup.py to this PR later with the rest of the changes

Copy link
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bihan, I've tested the PR locally and everything works really well! I've only encountered a couple of small issues with some edge cases, please see my comments.

We are almost ready to merge the PR, so it'd also be great if you could resolve the merge conflict in setup.py by rebasing onto master or merging master into the PR branch.

src/dstack/_internal/core/backends/vultr/compute.py Outdated Show resolved Hide resolved
Comment on lines 914 to 915
if backend_type == BackendType.VULTR and instance_type_name.startswith("vbm"):
return timedelta(seconds=1800)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 minutes is a huge timeout already, but apparently it's not enough - in my testing, some bare metal instances took almost 40 minutes to start. And I haven't tested H100 and MI300X yet, so maybe some instances will take even longer. Additionally, the timeout (at least the one in process_running_jobs) should account for the pulling of the Docker image, which can also take a few minutes for large images.

Not sure if we can do anything about it but increase the timeout even more. It's important to give the instance enough time to start, otherwise dstack will terminate it, but the user will still be charged for the wasted provisioning time.

I can suggest 55 minutes. Not 60 to avoid being charged for the second hour, since Vultr charges hourly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvstme Just thinking out loud. Should we provide warning to users when provisioning bare-metals? May be we can warn client in the logs that "Vultr might take more than 30 mins for setup".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think a message in dstack apply would be informative. Although this seems a bit tricky to implement at the moment. In general, we don't know which instance will be assigned to the job before provisioning starts, it may end up not being bare metal or Vultr. And even after the instance is assigned, the CLI can't tell if it is actually being provisioned or just reused from an existing fleet.

src/dstack/_internal/core/backends/vultr/api_client.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jvstme jvstme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bihan, thank you for the update, the PR looks good to me.

I found one more issue though - quite often instance termination fails if you cancel the provisioning with Ctrl+C shortly after it has started:

[22:32:37] ERROR    dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: This subscription is not currently active, you cannot destroy it.","status":500}')

... or shortly after it has finished:

[22:28:07] ERROR    dstack._internal.server.background.tasks.process_instances:801 Failed to terminate instance cloud-0: BackendError('{"error":"Unable to destroy server: Unable to remove VM: Server is currently locked","status":500}')

This is pretty critical, since dstack then forgets about the instance while it keeps running in Vultr, possibly causing unexpected charges for the user.

However, I think this should be fixed by adding termination retries in the dstack server code, not in the backend implementation. So I'd suggest to address this issue separately as part of #1551. We'll discuss this with the rest of the team on Monday.

@jvstme
Copy link
Collaborator

jvstme commented Jan 13, 2025

Will add termination retries in #1551

@jvstme jvstme merged commit e41a29a into dstackai:master Jan 13, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants