Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In EKS > Set up > Quickstart tutorial, ebs-deployment.yaml refers to a pvc that doesn't exist #811

Open
wants to merge 4 commits into
base: mainline
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/workflows/vale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Style check

on:
pull_request:
workflow_dispatch:

jobs:
style-job:
runs-on: ubuntu-latest
steps:
- name: Check out
uses: actions/checkout@v3

# For AsciiDoc users:
- name: Install Asciidoctor
run: sudo apt-get install -y asciidoctor

- name: Run Vale
uses: errata-ai/vale-action@reviewdog
env:
GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
with:
fail_on_error: true
reporter: github-pr-check
filter_mode: added
files: latest/ug
continue-on-error: false
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,5 @@ build
*.xlsx
*.xpr
*.zip
.vale/*
vale/styles/AsciiDoc/
vale/styles/RedHat/
12 changes: 8 additions & 4 deletions .vale.ini
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
StylesPath = .vale/styles
StylesPath = vale/styles

MinAlertLevel = suggestion

Packages = RedHat, AsciiDoc

Vocab = EksDocsVocab

# Ignore files in dirs starting with `.` to avoid raising errors for `.vale/fixtures/*/testinvalid.adoc` files
[[!.]*.adoc]
BasedOnStyles = RedHat, AsciiDoc
RedHat.CaseSensitiveTerms = warning
BasedOnStyles = RedHat, AsciiDoc, EksDocs
RedHat.GitLinks = OFF
AsciiDoc.UnsetAttributes = OFF
RedHat.CaseSensitiveTerms = suggestion
RedHat.TermsErrors = warning
AsciiDoc.UnsetAttributes = NO
RedHat.Spacing = warning
2 changes: 2 additions & 0 deletions latest/ug/book.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ include::connector/eks-connector.adoc[leveloffset=+1]

include::outposts/eks-outposts.adoc[leveloffset=+1]

include::ml/machine-learning-on-eks.adoc[leveloffset=+1]

include::related-projects.adoc[leveloffset=+1]

include::roadmap.adoc[leveloffset=+1]
Expand Down
12 changes: 0 additions & 12 deletions latest/ug/integrations/deep-learning-containers.adoc

This file was deleted.

3 changes: 0 additions & 3 deletions latest/ug/integrations/eks-integrations.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,6 @@ In addition to the services covered in other sections, Amazon EKS works with mor
include::creating-resources-with-cloudformation.adoc[leveloffset=+1]


include::deep-learning-containers.adoc[leveloffset=+1]


include::integration-detective.adoc[leveloffset=+1]


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ include::../attributes.txt[]
[.topic]
[[capacity-blocks,capacity-blocks.title]]
= Create self-managed nodes with Capacity Blocks for ML
:info_titleabbrev: Capacity Blocks for ML
:info_titleabbrev: Reserve GPUs

[abstract]
--
Expand Down Expand Up @@ -46,7 +46,6 @@ Make sure the `LaunchTemplateData` includes the following:

+
The following is an excerpt of a CloudFormation template that creates a launch template targeting a Capacity Block.
+
[source,yaml,subs="verbatim,attributes,quotes"]
----
NodeLaunchTemplate:
Expand All @@ -67,7 +66,6 @@ NodeLaunchTemplate:
- sg-05b1d815d1EXAMPLE
UserData: user-data
----
+
You must pass the subnet in the Availability Zone in which the reservation is made because Capacity Blocks are zonal.
. Use the launch template to create a self-managed node group. If you're doing this prior to the capacity reservation becoming active, then set the desired capacity to `0`. When creating the node group, make sure that you are only specifying the respective subnet for the Availability Zone in which the capacity is reserved.
+
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
//!!NODE_ROOT <section>
include::../attributes.txt[]

[.topic]
[[inferentia-support,inferentia-support.title]]
= Deploy [.noloc]`ML` inference workloads with {aws}[.noloc]`Inferentia` on Amazon EKS
= Use {aws} [.noloc]`Inferentia` workloads with Amazon EKS for Machine Learning
:info_doctype: section
:info_title: Deploy ML inference workloads with AWSInferentia on Amazon EKS
:info_titleabbrev: Machine learning inference
:info_title: Use {aws} Inferentia workloads with your EKS cluster for Machine Learning
:info_titleabbrev: Create {aws} Inferentia cluster
:info_abstract: Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application.

include::../attributes.txt[]

[abstract]
--
Learn how to create an Amazon EKS cluster with nodes running Amazon EC2 Inf1 instances for machine learning inference using {aws} Inferentia chips and deploy a TensorFlow Serving application.
Expand Down
68 changes: 68 additions & 0 deletions latest/ug/ml/machine-learning-on-eks.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
//!!NODE_ROOT <chapter>
include::../attributes.txt[]
[.topic]
[[machine-learning-on-eks,machine-learning-on-eks.title]]
= Overview of Machine Learning on Amazon EKS
:doctype: book
:sectnums:
:toc: left
:icons: font
:experimental:
:idprefix:
:idseparator: -
:sourcedir: .
:info_doctype: chapter
:info_title: Machine Learning on Amazon EKS Overview
:info_titleabbrev: Machine Learning on EKS
:keywords: Machine Learning, Amazon EKS, Artificial Intelligence
:info_abstract: Learn to manage containerized applications with Amazon EKS

[abstract]
--
Complete guide for running Machine Learning applications on Amazon EKS. This includes everything from provisioning infrastructure to choosing and deploying Machine Learning workloads on Amazon EKS.
--

[[ml-features,ml-features.title]]

Machine Learning (ML) is an area of Artificial Intelligence (AI) where machines process large amounts of data to look for patterns and make connections between the data. This can expose new relationships and help predict outcomes that might not have been apparent otherwise.

For large-scale ML projects, data centers must be able to store large amounts of data, process data quickly, and integrate data from many sources. The platforms running ML applications must be reliable and secure, but also offer resiliency to recover from data center outages and application failures. {aws} Elastic Kubernetes Service (EKS), running in the {aws} cloud, is particularly suited for ML workloads.

The primary goal of this section of the EKS User Guide is to help you put together the hardware and software component to build platforms to run Machine Learning workloads in an EKS cluster.
We start by explaining the features and services available to you in EKS and the {aws} cloud, then provide you with tutorials to help you work with ML platforms, frameworks, and models.

=== Advantages of Machine Learning on EKS and the {aws} cloud

Amazon Elastic Kubernetes Service (EKS) is a powerful, managed Kubernetes platform that has become a cornerstone for deploying and managing AI/ML workloads in the cloud. With its ability to handle complex, resource-intensive tasks, Amazon EKS provides a scalable and flexible foundation for running AI/ML models, making it an ideal choice for organizations aiming to harness the full potential of machine learning.

Key Advantages of AI/ML Platforms on Amazon EKS include:

* *Scalability and Flexibility*
Amazon EKS enables organizations to scale AI/ML workloads seamlessly. Whether you're training large language models that require vast amounts of compute power or deploying inference pipelines that need to handle unpredictable traffic patterns, EKS scales up and down efficiently, optimizing resource use and cost.

* *High Performance with GPUs and Neuron Instances*
Amazon EKS supports a wide range of compute options, including GPUs and {aws}} Neuron instances, which are essential for accelerating AI/ML workloads. This support allows for high-performance training and low-latency inference, ensuring that models run efficiently in production environments.

* *Integration with AI/ML Tools*
Amazon EKS integrates seamlessly with popular AI/ML tools and frameworks like TensorFlow, PyTorch, and Ray, providing a familiar and robust ecosystem for data scientists and engineers. These integrations enable users to leverage existing tools while benefiting from the scalability and management capabilities of Kubernetes.

* *Automation and Management*
Kubernetes on Amazon EKS automates many of the operational tasks associated with managing AI/ML workloads. Features like automatic scaling, rolling updates, and self-healing ensure that your applications remain highly available and resilient, reducing the overhead of manual intervention.

* *Security and Compliance*
Running AI/ML workloads on Amazon EKS provides robust security features, including fine-grained IAM roles, encryption, and network policies, ensuring that sensitive data and models are protected. EKS also adheres to various compliance standards, making it suitable for enterprises with strict regulatory requirements.

=== Why Choose Amazon EKS for AI/ML?
Amazon EKS offers a comprehensive, managed environment that simplifies the deployment of AI/ML models while providing the performance, scalability, and security needed for production workloads. With its ability to integrate with a variety of AI/ML tools and its support for advanced compute resources, EKS empowers organizations to accelerate their AI/ML initiatives and deliver innovative solutions at scale.

By choosing Amazon EKS, you gain access to a robust infrastructure that can handle the complexities of modern AI/ML workloads, allowing you to focus on innovation and value creation rather than managing underlying systems. Whether you are deploying simple models or complex AI systems, Amazon EKS provides the tools and capabilities needed to succeed in a competitive and rapidly evolving field.

=== Start using Machine Learning on EKS

To begin planning for and using Machine Learning platforms and workloads on EKS on the {aws} cloud, proceed to the <<ml-get-started>> section.

include::ml-get-started.adoc[leveloffset=+1]

include::ml-prepare-for-cluster.adoc[leveloffset=+1]

include::ml-tutorials.adoc[leveloffset=+1]
87 changes: 87 additions & 0 deletions latest/ug/ml/ml-eks-optimized-ami.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
//!!NODE_ROOT <section>
[.topic]
[[ml-eks-optimized-ami,ml-eks-optimized-ami.title]]
= Create nodes with EKS optimized accelerated Amazon Linux AMIs
:info_titleabbrev: Run GPU AMIs

include::../attributes.txt[]

The Amazon EKS optimized accelerated Amazon Linux AMI is built on top of the standard Amazon EKS optimized Amazon Linux AMI. For details on these AMIs, see <<gpu-ami>>.
The following text describes how to enable {aws} Neuron-based workloads.

.To enable {aws} Neuron (ML accelerator) based workloads
For details on training and inference workloads using [.noloc]`Neuron` in Amazon EKS, see the following references:

* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html[Containers - Kubernetes - Getting Started] in the _{aws} [.noloc]`Neuron` Documentation_
* https://github.com/aws-neuron/aws-neuron-eks-samples/blob/master/README.md#training[Training] in {aws} [.noloc]`Neuron` EKS Samples on GitHub
* <<inferentia-support,Deploy ML inference workloads with AWSInferentia on Amazon EKS>>

The following procedure describes how to run a workload on a GPU based instance with the Amazon EKS optimized accelerated AMI.

. After your GPU nodes join your cluster, you must apply the https://github.com/NVIDIA/k8s-device-plugin[NVIDIA device plugin for Kubernetes] as a [.noloc]`DaemonSet` on your cluster. Replace [.replaceable]`vX.X.X` with your desired https://github.com/NVIDIA/k8s-device-plugin/releases[NVIDIA/k8s-device-plugin] version before running the following command.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml
----
. You can verify that your nodes have allocatable GPUs with the following command.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
----
. Create a file named `nvidia-smi.yaml` with the following contents. Replace [.replaceable]`tag` with your desired tag for https://hub.docker.com/r/nvidia/cuda/tags[nvidia/cuda]. This manifest launches an https://developer.nvidia.com/cuda-zone[NVIDIA CUDA] container that runs `nvidia-smi` on a node.
+
[source,yaml,subs="verbatim,attributes,quotes"]
----
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:tag
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
----
. Apply the manifest with the following command.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
kubectl apply -f nvidia-smi.yaml
----
. After the [.noloc]`Pod` has finished running, view its logs with the following command.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
kubectl logs nvidia-smi
----
+
An example output is as follows.
+
[source,bash,subs="verbatim,attributes,quotes"]
----
Mon Aug 6 20:23:31 20XX
+-----------------------------------------------------------------------------+
| NVIDIA-SMI XXX.XX Driver Version: XXX.XX |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 46C P0 47W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
----


51 changes: 51 additions & 0 deletions latest/ug/ml/ml-get-started.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
//!!NODE_ROOT <section>

[.topic]
[[ml-get-started,ml-get-started.title]]
= Get started with ML
:info_doctype: section
:info_title: Get started deploying Machine Learning tools on EKS
:info_titleabbrev: Get started with ML
:info_abstract: Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy them to the {aws} cloud.

include::../attributes.txt[]


[abstract]
--
Choose the Machine Learning on EKS tools and platforms that best suit your needs, then use quick start procedures to deploy ML workloads and EKS clusters to the {aws} cloud.
--

To jump into Machine Learning on EKS, start by choosing from these prescriptive patterns to quickly get an EKS cluster and ML software and hardware ready to begin running ML workloads. Most of these patterns are based on Terraform blueprints that are available from the https://awslabs.github.io/data-on-eks/docs/introduction/intro[Data on Amazon EKS] site. Before you begin, here are few things to keep in mind:

* GPUs or Neuron instances are required to run these procedures. Lack of availability of these resources can cause these procedures to fail during cluster creation or node autoscaling.
* Neuron SDK (Tranium and Inferentia-based instances) can save money and are more available than NVIDIA GPUs. So, when your worloads permit it, we recommend that you consider using Neutron for your Machine Learning workloads (see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/[Welcome to {aws} Neuron]).
* Some of the getting started experiences here require that you get data via your own https://huggingface.co/[Hugging Face] account.

To get started, choose from the following selection of patterns that are designed to get you started setting up infrastructure to run your Machine Learning workloads:

* *https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub on EKS]*: Explore the https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub[JupyterHub blueprint], which showcases Time Slicing and MIG features, as well as multi-tenant configurations with profiles. This is ideal for deploying large-scale JupyterHub platforms on EKS.
* *https://aws.amazon.com/ai/machine-learning/neuron/[Large Language Models on {aws} Neuron and RayServe]*: Use https://aws.amazon.com/ai/machine-learning/neuron/[{aws} Neuron] to run large language models (LLMs) on Amazon EKS and {aws} Trainium and {aws} Inferentia accelerators. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/Neuron/vllm-ray-inf2[Serving LLMs with RayServe and vLLM on {aws} Neuron] for instructions on setting up a platform for making inference requests, with components that include:
+
** {aws} Neuron SDK toolkit for deep learning
** {aws} Inferentia and Trainium accelerators
** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site)
** RayServe scalable model serving library (see the https://docs.ray.io/en/latest/serve/index.html[Ray Serve: Scalable and Programmable Serving] site)
** Llama-3 language model, using your own https://huggingface.co/[Hugging Face] account.
** Observability with {aws} CloudWatch and Neuron Monitor
** Open WebUI
* *https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Large Language Models on NVIDIA and Triton]*: Deploy multiple large language models (LLMs) on Amazon EKS and NVIDIA GPUs. See https://awslabs.github.io/data-on-eks/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer[Deploying Multiple Large Language Models with NVIDIA Triton Server and vLLM] for instructions for setting up a platform for making inference requests, with components that include:
+
** NVIDIA Triton Inference Server (see the https://github.com/triton-inference-server/server[Triton Inference Server] GitHub site)
** vLLM - variable-length language model (see the https://docs.vllm.ai/en/latest/[vLLM] documentation site)
** Two language models: mistralai/Mistral-7B-Instruct-v0.2 and meta-llama/Llama-2-7b-chat-hf, using your own https://huggingface.co/[Hugging Face] account.

=== Continuing with ML on EKS

Along with choosing from the blueprints described on this page, there are other ways you can proceed through the ML on EKS documentation if you prefer. For example, you can:

* *Try tutorials for ML on EKS* – Run other end-to-end tutorials for building and running your own Machine Learning models on EKS. See <<ml-tutorials>>.

To improve your work with ML on EKS, refer to the following:

* *Prepare for ML* – Learn how to prepare for ML on EKS with features like custom AMIs and GPU reservations. See <<ml-prepare-for-cluster>>.
Loading