Also compare with the similar flow diagram for IDE integration with SageMaker Studio.
The following diagram describes the flow of events for the training job use case:
- Data Scientists (DS) starts a training job, with SSH Helper Lib as a dependency.
- The SageMaker Python SDK starts the job, sending the train.py script, and SSH Helper lib as code dependencies.
- Amazon SageMaker control plain starts the host and the container.
- The user script (train.py) starts running, starting SSH helper, which fetches AWS SSM agent and other packages from the Internet, and installs them.
- SSH Helper starts the SSM agent
- Through SSM, SSH Helper registers the container as an SSM
managed instance
, and tags it with the DS AWS user/role name. - SSH Helper printouts the
managed instance
ID. The log is streamed to CloudWatch Logs. - The DS manually/automatically tails the training job logs for the
managed instance
ID.
9-12. Optionally: The DS starts a process to copy over his SSH Public key to the container, needed to set up port forwarding via SSH (e.g., for remote debugging)
- The DS uses the AWS SSM CLI to start a shell providing the managed instance ID as a parameter.
Optionally, user starts SSM with SSH port forwarding with the helper command
sm-local-ssh-training connnect <<training_job_name>>
. - AWS SSM IAM rules verify that the user is allowed to take this action and that the instance is tagged with the DS's AWS user/role name. Once verified, a session is created with the SSM Agent running in the container.
- The SSM agent generates a shell by spinning off a new bash shell process. Optionally, SSH port forwarding starts over SSM connection to let user connect to remote processes over TCP in both directions.