[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

omrishiv · 2025-02-12T19:53:32Z

What does this PR do?

This PR lays out the infrastructure foundation for AIoEKS (AI on EKS). It aims to create one infrastructure deployment that can be customized into different use cases to allow for advanced usage of the AI environment as well as highlighting purpose built blueprints.

Motivation

The current approach to blueprints allows very isolated environments that showcase a single task: deploy model X into EKS, deploy MLFLow, deploy Jupyterhub, etc. This is nice when it comes to isolation, but creates issues with maintainability as each blueprint needs to be updated when addons are updated or when infrastructure needs updating.

This PR aims to consolidate the core infrastructure and addons of all of the DoEKS blueprint and set the foundation for a configurable AI/ML environment based on needs and best practices. This will increase maintainability, allow for better customization, and enable adding more functionality

Contributing

We need help retesting the existing blueprints and deployments to make sure they work in the current environment.

If you are interested in helping, please reach out before you start so we can make sure no one else is working on it.

To contribute, please branch off of this branch in your fork and open PRs against this branch. We will merge into this branch as work is validated and then merge this branch in its entirety back into DoEKS

More

Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

Changelog

combined all ai/ml blueprints into one infrastructure

jark
inferentia/trainium
fsx driver + fsx volume
mlflow
all addons are now toggleable in variables

fixed:

gpu only pods only schedule on gpu nodes
accelerator nodes are now labeled with their accelerator (neuron/nvidia)
remove loadBalancer from argo workflows
EFS now uses efs-csi-driver, not NFS tool (broken on bottlerocket)

added:

neuron-monitor
dcgm

Signed-off-by: omrishiv <[email protected]>

omrishiv · 2025-02-12T19:59:19Z

Addresses #720 , #729, #727

… consolidate-aiml-blueprints-test

ai-ml/infrastructure/terraform/eks.tf

namejsjeongkr · 2025-02-19T07:05:51Z

ai-ml/infrastructure/terraform/fsx-for-lustre.tf

@@ -7,24 +7,25 @@
 #---------------------------------------------------------------
 # NOTE: FSx for Lustre file system creation can take up to 10 mins
 resource "aws_fsx_lustre_file_system" "this" {
+  count                       = var.deploy_fsx_volume ? 1 : 0


I recommend you use for_each rather than count. You know, it's better to use for_each because it's much more flexible.

We tend to use it here as a way to decide whether or not to deploy a resource. Is there more benefit to using a for_each for this? I'm open to it, just following the convention used here.

It is true that both for_each and count are flags for creating resources, but for_each has the following advantages over count.

Resource Identification: for_each identifies each resource instance with a unique key. This makes the lifecycle management of resources more predictable and stable.

Minimizing Impact During Changes: When input variables change, for_each only modifies the specific resources affected, preventing unnecessary recreation of resources.

Multiple Resource Management: If you need to manage multiple storage_classes in the future, using for_each allows for easy scalability without code changes.

Thank you for highlighting all of these; I was following convention, but I do see some prior art here for using for_each; I'll switch over to that

Thank you for accepting my recommedation !

namejsjeongkr · 2025-02-19T07:07:23Z

ai-ml/infrastructure/terraform/helm-values/jupyterhub-values-cognito.yaml

+#  userPods:
+#    nodeAffinity:
+#      matchNodePurpose: require # This will force single-user pods to use an specific karpenter provisioner


Is it gonna be unused as well ?

we will use resource requesting as the mechanism of binding to specific nodes between GPU/neuron/CPU. If we need finer grained control, we can change that

askulkarni2

@omrishiv great stuff! Thank you for pushing this through. Some comments based on a first pass.

ai-ml/infrastructure/terraform/eks.tf

askulkarni2 · 2025-02-19T17:26:01Z

ai-ml/infrastructure/terraform/fsx-for-lustre.tf

@@ -7,24 +7,25 @@
 #---------------------------------------------------------------


Recommend we use the terraform module here. Example here

askulkarni2 · 2025-02-19T17:30:30Z

ai-ml/infrastructure/terraform/monitoring/dcgm.yaml

@@ -0,0 +1,66 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.


Can we use the helm chart instead?

askulkarni2 · 2025-02-19T17:34:28Z

ai-ml/infrastructure/terraform/monitoring/neuron-monitor-daemonset.yaml

@@ -0,0 +1,42 @@
+apiVersion: apps/v1


Let's pull the daemonset dynamically. Similar to this...

https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/fc964ea63d4842f64b67e298920cfd7f81e5eb65/patterns/multi-node-vllm/helm.tf#L43C1-L45C2

I don't believe we can do this. We use taints on the nodes to make sure non-neuron resources don't run on it and the daemonset from the neuron sdk does not have the toleration. That being said, neither does ours. I will add it.

askulkarni2 · 2025-02-19T17:37:33Z

ai-ml/infrastructure/terraform/variables.tf

+variable "huggingface_token" {
+  description = "Hugging Face Secret Token"
+  type        = string
+  default     = "DUMMY_TOKEN_REPLACE_ME"
+  sensitive   = true
+}


Suggested change

variable "huggingface_token" {

description = "Hugging Face Secret Token"

type = string

default = "DUMMY_TOKEN_REPLACE_ME"

sensitive = true

}

variable "huggingface_token" {

description = "Hugging Face Secret Token"

type = string

default = "DUMMY_TOKEN_REPLACE_ME"

ephemeral = true

}

askulkarni2 · 2025-02-19T17:40:31Z

ai-ml/jupyterhub/terraform/blueprint.tfvars

@@ -0,0 +1,6 @@
+name = "jark-stack"


Do we need a separate JupyterHub?

The reason to potentially keep jupyterhub is to demonstrate the different auth mechanisms if that's relevant. However, I think we can remove trainium-inferentia, nvidia-triton-server (need to move the NIM deployment into infrastructure), ray. Then, the examples from these blueprints can be moved under the current gen-ai folder, which should be renamed to ai-ml and we can highlight them through the website

askulkarni2 · 2025-02-19T17:40:51Z

ai-ml/jupyterhub/terraform/blueprint.tfvars

@@ -0,0 +1,6 @@
+name = "jark-stack"


Suggested change

name = "jark-stack"

name = "jupyterhub"

askulkarni2 · 2025-02-19T17:41:09Z

ai-ml/jupyterhub/terraform/blueprint.tfvars

+enable_volcano = true
+enable_kuberay_operator = true


Are these needed?

omrishiv added 12 commits January 24, 2025 15:31

consolidate ai blueprints

2a6c6db

split dcgm and enable volcano to fix kuberay startup

a148ef5

add 12xlarge to g5

06de4c6

add in emr and amp

0119abd

add missing kube-prometheus-amp-enable.yaml file

da356f4

add missing dcgm components

a6c9783

initial move from genai to ai

b79fd2c

set ai stack defaults and add jupyterhub

c66a351

jark stack consolidation

07cffe3

consolidated blueprints

5566f80

add missing blueprint tfvars

43f2659

merge aioeks into upstream

1c07ad1

Signed-off-by: omrishiv <[email protected]>

update fsx csi driver variable

e4ed1cc

omrishiv mentioned this pull request Feb 14, 2025

fix: Refactoring terraform #720

Open

5 tasks

vishdivg and others added 5 commits February 17, 2025 13:23

missing bionemo tfvar

94b8576

Merge branch 'aioeks' of https://github.com/vishdivg/data-on-eks into…

cd7a834

… consolidate-aiml-blueprints-test

add missing redis ha, torchx etcd vars

91eb0e1

style fix

2a352da

fix: Bionemo module test (#752)

9a24419

namejsjeongkr reviewed Feb 19, 2025

View reviewed changes

ai-ml/infrastructure/terraform/eks.tf Outdated Show resolved Hide resolved

namejsjeongkr reviewed Feb 19, 2025

View reviewed changes

askulkarni2 reviewed Feb 19, 2025

View reviewed changes

omrishiv added 2 commits February 20, 2025 12:02

Merge branch 'aioeks-vars' into aioeks

72ce9e6

addressing some review comments

7898ffa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

omrishiv commented Feb 12, 2025 •

edited

Loading

omrishiv commented Feb 12, 2025 •

edited

Loading

namejsjeongkr Feb 19, 2025

omrishiv Feb 19, 2025

namejsjeongkr Feb 19, 2025 •

edited

Loading

omrishiv Feb 20, 2025

namejsjeongkr Feb 21, 2025

namejsjeongkr Feb 19, 2025

omrishiv Feb 19, 2025

askulkarni2 left a comment

askulkarni2 Feb 19, 2025

askulkarni2 Feb 19, 2025

askulkarni2 Feb 19, 2025

omrishiv Feb 20, 2025

askulkarni2 Feb 19, 2025

askulkarni2 Feb 19, 2025

omrishiv Feb 20, 2025 •

edited

Loading

askulkarni2 Feb 19, 2025

askulkarni2 Feb 19, 2025

		@@ -7,24 +7,25 @@
		#---------------------------------------------------------------

		@@ -0,0 +1,66 @@
		# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.

		enable_volcano = true
		enable_kuberay_operator = true

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

Are you sure you want to change the base?

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

Conversation

omrishiv commented Feb 12, 2025 • edited Loading

What does this PR do?

Motivation

Contributing

More

For Moderators

Additional Notes

Changelog

omrishiv commented Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namejsjeongkr Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askulkarni2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omrishiv Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omrishiv commented Feb 12, 2025 •

edited

Loading

omrishiv commented Feb 12, 2025 •

edited

Loading

namejsjeongkr Feb 19, 2025 •

edited

Loading

omrishiv Feb 20, 2025 •

edited

Loading