Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - AWS instance type not properly respected when gpu are enabled #2782

Open
viniciusdc opened this issue Oct 21, 2024 · 0 comments · May be fixed by #2787
Open

[BUG] - AWS instance type not properly respected when gpu are enabled #2782

viniciusdc opened this issue Oct 21, 2024 · 0 comments · May be fixed by #2787
Labels
area: schema good first issue Good for newcomers impact: medium 🟨 This item affects some users, not critical needs: PR 📬 This item has been scoped and needs to be worked on provider: AWS type: bug 🐛 Something isn't working
Milestone

Comments

@viniciusdc
Copy link
Contributor

Describe the bug

Since the latest release, when #2604 changes were integrated, a bug was introduced due to how we currently load our schema and perform validation versus the way the stages files are rendered during deploy. Basicaly, in that PR we changed the behavior on how the instance_types (AL2_x86_64_GPU, AL2_x86_64 and CUSTOM) are forwarded to their respective terraform variables under the node_groups.

Right now, when utilizing the following config block for example:

amazon_web_services:
  ...
  node_groups:
    ...
    gpu-1x-t4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
      gpu: true
profiles:
  jupyterlab:
   - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.9.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        node_selector:
          "dedicated": "gpu-1x-t4"

The expected behavior would be for an instance with a GPU to be spawned and assigned to the user's pod right now, though. The instance is correctly scaled up, though the type is wrongly defaulted to ``AL2_x86_64_GPU`, which results in the incorrect AMI being assigned to the instance and the NVIDIA drivers expected to be installed by the daemon never triggering.

The problem arises from this part of our code:

class AWSNodeGroupInputVars(schema.Base):
name: str
instance_type: str
gpu: bool = False
min_size: int
desired_size: int
max_size: int
single_subnet: bool
permissions_boundary: Optional[str] = None
ami_type: Optional[AWSAmiTypes] = None
launch_template: Optional[AWSNodeLaunchTemplate] = None
@field_validator("ami_type", mode="before")
@classmethod
def _infer_and_validate_ami_type(cls, value, values) -> str:
gpu_enabled = values.get("gpu", False)
# Auto-set ami_type if not provided
if not value:
if values.get("launch_template") and values["launch_template"].ami_id:
return "CUSTOM"
if gpu_enabled:
return "AL2_x86_64_GPU"
return "AL2_x86_64"
# Explicit validation
if value == "AL2_x86_64" and gpu_enabled:
raise ValueError(
"ami_type 'AL2_x86_64' cannot be used with GPU enabled (gpu=True)."
)
return value

I suggest that we remove the "dynamic" handling of the instance type from the Pydantic validator and instead use a custom function to handle the proper logic at run time, for example:

def construct_aws_ami_type(
    gpu_enabled: bool, launch_template: Dict, ami_type: str = None
):
    """Construct the AWS AMI type based on the provided parameters."""
    if ami_type:
        return ami_type

    if launch_template and launch_template.get("ami_id"):
        return "CUSTOM"

    if gpu_enabled:
        return "AL2_x86_64_GPU"

    return "AL2_x86_64"

and there is also a need for changing the current Enum object, as it also is not properly serializable right now:

class AWSAmiTypes(str, enum.Enum):
    AL2_x86_64 = "AL2_x86_64"
    AL2_x86_64_GPU = "AL2_x86_64_GPU"
    CUSTOM = "CUSTOM"

Expected behavior

Gpus instances should scale properly while their drivers are properly installed as well

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

Run an AWS deployment that requires a GPU profile, bug introduced in latest release version (2024.9.1)

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

No response

Anything else?

No response

@viniciusdc viniciusdc added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage good first issue Good for newcomers provider: AWS needs: PR 📬 This item has been scoped and needs to be worked on impact: medium 🟨 This item affects some users, not critical area: schema and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Oct 21, 2024
@viniciusdc viniciusdc linked a pull request Oct 22, 2024 that will close this issue
10 tasks
@viniciusdc viniciusdc added this to the 2024.9.2 milestone Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: schema good first issue Good for newcomers impact: medium 🟨 This item affects some users, not critical needs: PR 📬 This item has been scoped and needs to be worked on provider: AWS type: bug 🐛 Something isn't working
Projects
Status: New 🚦
Development

Successfully merging a pull request may close this issue.

1 participant