Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Update documentation for installing EFA device plugin to use official EKS chart #850

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions containers/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,22 +66,24 @@ The EFA devices are exposed to the container using the --device option

--device /dev/infiniband/uverbs0

In the kubernetes environment the EFA device plugin is used to detect and advertise
EFA interfaces.
In a Kubernetes environment, the EFA device plugin is used to detect and advertise
the available EFA interfaces. The EFA device plugin can be installed using the `Helm chart provided by Amazon EKS <https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin>`_

::

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
helm repo add eks https://aws.github.io/eks-charts
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin

Application can use the resource type vpc.amazonaws.com/efa in a pod request spec
Once the plugin is deployed, applications can use the resource type vpc.amazonaws.com/efa in a pod request spec

::

vpc.amazonaws.com/efa: 4

resources:
limits:
vpc.amazonaws.com/efa: 4


Can distributed training jobs be run without EFA devices in container
---------------------------------------------------------------------
No. For distributed training jobs in Trainium all the EFA inrerfaces in trn1.32xlarge needs to be
exposed to the containers
No. For distributed training jobs on Trainium, all EFA interfaces provided by trn1.32xlarge need to be
attached to the container