-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] AIoEKS Blueprint Consolidation #751
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: omrishiv <[email protected]>
@@ -7,24 +7,25 @@ | |||
#--------------------------------------------------------------- | |||
# NOTE: FSx for Lustre file system creation can take up to 10 mins | |||
resource "aws_fsx_lustre_file_system" "this" { | |||
count = var.deploy_fsx_volume ? 1 : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend you use for_each rather than count. You know, it's better to use for_each because it's much more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to use it here as a way to decide whether or not to deploy a resource. Is there more benefit to using a for_each for this? I'm open to it, just following the convention used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that both for_each and count are flags for creating resources, but for_each has the following advantages over count.
-
Resource Identification: for_each identifies each resource instance with a unique key. This makes the lifecycle management of resources more predictable and stable.
-
Minimizing Impact During Changes: When input variables change, for_each only modifies the specific resources affected, preventing unnecessary recreation of resources.
-
Multiple Resource Management: If you need to manage multiple storage_classes in the future, using for_each allows for easy scalability without code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for highlighting all of these; I was following convention, but I do see some prior art here for using for_each
; I'll switch over to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for accepting my recommedation !
# userPods: | ||
# nodeAffinity: | ||
# matchNodePurpose: require # This will force single-user pods to use an specific karpenter provisioner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it gonna be unused as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will use resource requesting as the mechanism of binding to specific nodes between GPU/neuron/CPU. If we need finer grained control, we can change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omrishiv great stuff! Thank you for pushing this through. Some comments based on a first pass.
@@ -7,24 +7,25 @@ | |||
#--------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,66 @@ | |||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the helm chart instead?
@@ -0,0 +1,42 @@ | |||
apiVersion: apps/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull the daemonset dynamically. Similar to this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we can do this. We use taints on the nodes to make sure non-neuron resources don't run on it and the daemonset from the neuron sdk does not have the toleration. That being said, neither does ours. I will add it.
variable "huggingface_token" { | ||
description = "Hugging Face Secret Token" | ||
type = string | ||
default = "DUMMY_TOKEN_REPLACE_ME" | ||
sensitive = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable "huggingface_token" { | |
description = "Hugging Face Secret Token" | |
type = string | |
default = "DUMMY_TOKEN_REPLACE_ME" | |
sensitive = true | |
} | |
variable "huggingface_token" { | |
description = "Hugging Face Secret Token" | |
type = string | |
default = "DUMMY_TOKEN_REPLACE_ME" | |
ephemeral = true | |
} |
@@ -0,0 +1,6 @@ | |||
name = "jark-stack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate JupyterHub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to potentially keep jupyterhub is to demonstrate the different auth mechanisms if that's relevant. However, I think we can remove trainium-inferentia
, nvidia-triton-server
(need to move the NIM deployment into infrastructure), ray
. Then, the examples from these blueprints can be moved under the current gen-ai
folder, which should be renamed to ai-ml
and we can highlight them through the website
@@ -0,0 +1,6 @@ | |||
name = "jark-stack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name = "jark-stack" | |
name = "jupyterhub" |
enable_volcano = true | ||
enable_kuberay_operator = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these needed?
What does this PR do?
This PR lays out the infrastructure foundation for AIoEKS (AI on EKS). It aims to create one infrastructure deployment that can be customized into different use cases to allow for advanced usage of the AI environment as well as highlighting purpose built blueprints.
Motivation
The current approach to blueprints allows very isolated environments that showcase a single task: deploy model X into EKS, deploy MLFLow, deploy Jupyterhub, etc. This is nice when it comes to isolation, but creates issues with maintainability as each blueprint needs to be updated when addons are updated or when infrastructure needs updating.
This PR aims to consolidate the core infrastructure and addons of all of the DoEKS blueprint and set the foundation for a configurable AI/ML environment based on needs and best practices. This will increase maintainability, allow for better customization, and enable adding more functionality
Contributing
We need help retesting the existing blueprints and deployments to make sure they work in the current environment.
If you are interested in helping, please reach out before you start so we can make sure no one else is working on it.
To contribute, please branch off of this branch in your fork and open PRs against this branch. We will merge into this branch as work is validated and then merge this branch in its entirety back into DoEKS
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes
Changelog
combined all ai/ml blueprints into one infrastructure
fixed:
added: