Name		Name	Last commit message	Last commit date
parent directory ..
chart		chart
examples		examples
README.md		README.md
release-instructions.md		release-instructions.md

README.md

PyTorchJob Generator

The Helm chart defined in this folder facilitates the configuration of PyTorch jobs for submission to an OpenShift cluster implementing MLBatch.

Invocations of this chart generate a PyTorchJob wrapped into an AppWrapper for better traceability and fault-tolerance.

Obtaining the Chart

To start with, add the mlbatch Helm chart repository.

helm repo add mlbatch https://project-codeflare.github.io/mlbatch
helm repo update

To verify the chart was installed correctly, search for AppWrapper.

helm search repo AppWrapper

You should see output similar to the following:

NAME                        	CHART VERSION	APP VERSION	DESCRIPTION
mlbatch/pytorchjob-generator	1.1.3        	v1beta2    	An AppWrapper generator for PyTorchJobs

Configuring the Job

Create a settings.yaml file with the settings for the PyTorch job, for example:

namespace: my-namespace       # namespace to deploy to (required)
jobName: my-job               # name of the generated AppWrapper and PyTorchJob objects (required)
queueName: default-queue      # local queue to submit to (default: default-queue)

numPods: 4                    # total pod count including master and worker pods (default: 1)
numCpusPerPod: 500m           # requested number of cpus per pod (default: 1)
numGpusPerPod: 8              # requested number of gpus per pod (default: 0)
totalMemoryPerPod: 1Gi        # requested amount of memory per pod (default: 1Gi)

priority: default-priority    # default-priority (default), low-priority, or high-priority

# container image for the pods (required)
containerImage: ghcr.io/foundation-model-stack/base:pytorch-latest-nightly-20230126

# setup commands to run in each pod (optional)
setupCommands:                
- git clone https://github.com/dbarnett/python-helloworld
- cd python-helloworld

# main program to invoke via torchrun (optional)
mainProgram: helloworld.py

To learn more about the available settings see chart/README.md.

Submitting the Job

To submit the Pytorch job to the cluster using the settings.yaml file, run:

helm template -f settings.yaml mlbatch/pytorchjob-generator | oc create -f-

To optionally capture the generated AppWrapper specification as a generated.yaml file, run instead:

helm template -f settings.yaml mlbatch/pytorchjob-generator | tee generated.yaml | oc create -f-

To remove the PyTorch job from the cluster, delete the generated AppWrapper object:

oc delete appwrapper -n my-namespace my-job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorchjob-generator

pytorchjob-generator

README.md

PyTorchJob Generator

Obtaining the Chart

Configuring the Job

Submitting the Job

Files

pytorchjob-generator

Directory actions

More options

Directory actions

More options

Latest commit

History

pytorchjob-generator

Folders and files

parent directory

README.md

PyTorchJob Generator

Obtaining the Chart

Configuring the Job

Submitting the Job