This sample demonstrates how to start training jobs using your own training script, packaged in a SageMaker-compatible container, using the Amazon AWS Controllers for Kubernetes (ACK) service controller for Amazon SageMaker.
This sample assumes that you have completed the common prerequisites.
You will need training data uploaded to an S3 bucket. Make sure you have AWS credentials and and have the bucket in the same region where you plan to create SageMaker resources. Run the following python script to upload sample data to your S3 bucket.
python3 s3_sample_data.py $S3_BUCKET_NAME
All SageMaker training jobs are run from within a container with all necessary dependencies and modules pre-installed and with the training scripts referencing the acceptable input and output directories. Sample container images are available.
A container image URL and tag looks has the following structure:
<account number>.dkr.ecr.<region>.amazonaws.com/<image name>:<tag>
In the my-training-job.yaml
file, modify the placeholder values with those associated with your account and training job.
To submit your prepared training job specification, apply the specification to your Kubernetes cluster as such:
$ kubectl apply -f my-training-job.yaml
trainingjob.sagemaker.services.k8s.aws.amazon.com/my-training-job created
To list all training jobs created using the ACK controller use the following command:
$ kubectl get trainingjob
To get more details about the training job once it's submitted, like checking the status, errors or parameters of the training job use the following command:
$ kubectl describe trainingjob my-training-job
To delete the training job, use the following command:
$ kubectl delete trainingjob my-training-job