Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

wants to merge 20 commits into from

Conversation

omrishiv
Copy link
Collaborator

@omrishiv omrishiv commented Feb 12, 2025

What does this PR do?

This PR lays out the infrastructure foundation for AIoEKS (AI on EKS). It aims to create one infrastructure deployment that can be customized into different use cases to allow for advanced usage of the AI environment as well as highlighting purpose built blueprints.

Motivation

The current approach to blueprints allows very isolated environments that showcase a single task: deploy model X into EKS, deploy MLFLow, deploy Jupyterhub, etc. This is nice when it comes to isolation, but creates issues with maintainability as each blueprint needs to be updated when addons are updated or when infrastructure needs updating.

This PR aims to consolidate the core infrastructure and addons of all of the DoEKS blueprint and set the foundation for a configurable AI/ML environment based on needs and best practices. This will increase maintainability, allow for better customization, and enable adding more functionality

Contributing

We need help retesting the existing blueprints and deployments to make sure they work in the current environment.

  • Bionemo
  • EMR Spark Rapids
  • Ray
  • Ray HA using Elasticache
  • Trainium
  • Jupyterhub
  • JARK stack

If you are interested in helping, please reach out before you start so we can make sure no one else is working on it.

To contribute, please branch off of this branch in your fork and open PRs against this branch. We will merge into this branch as work is validated and then merge this branch in its entirety back into DoEKS

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

Changelog

combined all ai/ml blueprints into one infrastructure

  • jark
  • inferentia/trainium
  • fsx driver + fsx volume
  • mlflow
  • all addons are now toggleable in variables

fixed:

  • gpu only pods only schedule on gpu nodes
  • accelerator nodes are now labeled with their accelerator (neuron/nvidia)
  • remove loadBalancer from argo workflows
  • EFS now uses efs-csi-driver, not NFS tool (broken on bottlerocket)

added:

  • neuron-monitor
  • dcgm

@omrishiv
Copy link
Collaborator Author

omrishiv commented Feb 12, 2025

Addresses #720 , #729, #727

@omrishiv omrishiv mentioned this pull request Feb 14, 2025
5 tasks
@@ -7,24 +7,25 @@
#---------------------------------------------------------------
# NOTE: FSx for Lustre file system creation can take up to 10 mins
resource "aws_fsx_lustre_file_system" "this" {
count = var.deploy_fsx_volume ? 1 : 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend you use for_each rather than count. You know, it's better to use for_each because it's much more flexible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to use it here as a way to decide whether or not to deploy a resource. Is there more benefit to using a for_each for this? I'm open to it, just following the convention used here.

Copy link

@namejsjeongkr namejsjeongkr Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that both for_each and count are flags for creating resources, but for_each has the following advantages over count.

  • Resource Identification: for_each identifies each resource instance with a unique key. This makes the lifecycle management of resources more predictable and stable.

  • Minimizing Impact During Changes: When input variables change, for_each only modifies the specific resources affected, preventing unnecessary recreation of resources.

  • Multiple Resource Management: If you need to manage multiple storage_classes in the future, using for_each allows for easy scalability without code changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for highlighting all of these; I was following convention, but I do see some prior art here for using for_each; I'll switch over to that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for accepting my recommedation !

Comment on lines 252 to 254
# userPods:
# nodeAffinity:
# matchNodePurpose: require # This will force single-user pods to use an specific karpenter provisioner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it gonna be unused as well ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will use resource requesting as the mechanism of binding to specific nodes between GPU/neuron/CPU. If we need finer grained control, we can change that

Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omrishiv great stuff! Thank you for pushing this through. Some comments based on a first pass.

@@ -7,24 +7,25 @@
#---------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend we use the terraform module here. Example here

@@ -0,0 +1,66 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the helm chart instead?

@@ -0,0 +1,42 @@
apiVersion: apps/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we can do this. We use taints on the nodes to make sure non-neuron resources don't run on it and the daemonset from the neuron sdk does not have the toleration. That being said, neither does ours. I will add it.

Comment on lines +116 to +121
variable "huggingface_token" {
description = "Hugging Face Secret Token"
type = string
default = "DUMMY_TOKEN_REPLACE_ME"
sensitive = true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variable "huggingface_token" {
description = "Hugging Face Secret Token"
type = string
default = "DUMMY_TOKEN_REPLACE_ME"
sensitive = true
}
variable "huggingface_token" {
description = "Hugging Face Secret Token"
type = string
default = "DUMMY_TOKEN_REPLACE_ME"
ephemeral = true
}

@@ -0,0 +1,6 @@
name = "jark-stack"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate JupyterHub?

Copy link
Collaborator Author

@omrishiv omrishiv Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to potentially keep jupyterhub is to demonstrate the different auth mechanisms if that's relevant. However, I think we can remove trainium-inferentia, nvidia-triton-server (need to move the NIM deployment into infrastructure), ray. Then, the examples from these blueprints can be moved under the current gen-ai folder, which should be renamed to ai-ml and we can highlight them through the website

@@ -0,0 +1,6 @@
name = "jark-stack"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "jark-stack"
name = "jupyterhub"

Comment on lines 5 to 6
enable_volcano = true
enable_kuberay_operator = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants