-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301
base: main
Are you sure you want to change the base?
Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301
Conversation
bd6930b
to
24405d8
Compare
} | ||
|
||
func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) { | ||
func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode
to numProcPerNode
and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode
looks confusing to me as it is not clear what does it represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:
- single-CPUs/GPUs per node
- multi-CPUs/GPUs per node
Similar to the torchrun
command's --nproc_per_node
arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..
24405d8
to
b19447b
Compare
/lgtm Great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some leftover comments :)
tests/kfto/resources/mnist.py
Outdated
backend="gloo" | ||
dist.init_process_group(backend=backend) | ||
# Wait for all nodes to join | ||
dist.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary? I don't see it always used in some other examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah right, Initially I saw that master pod completed it's execution even if worker pods were still running which might be because of bad setup of distributed environment.. then I read about this barrier function.
[rank1]: RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.
this is needed to keep all nodes in sync so that if master pod is ready, but worker pod is still installing requirements.. master pod will wait untill all /worker-nodes have reached to that certain point(in this case untill all nodes have initialised dist backend process) so they will block (or "wait") because they run into the barrier until all processes have reached a barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, as you say this may be needed because of the pip install
command we run that delays when the node is able to join. Otherwise this should not be needed.
Excerpted from the init_process_group
code, which gives some hints:
# barrier at the end to ensure that once we return from this method, all
# process groups including global variables (if any) are updated
# correctly on all ranks.
# Update 04/2023: for large-scale runs, this barrier (esp. store-based
# barrier) may be costly and/or unscalable. Also, in a lot of cases,
# these barriers may be unnecessary, as proven by a green CI after
# removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been
# added which enables this barrier only when set to 1.
b19447b
to
e7edc7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
good job
@abhijeet-dhumal I ran
I guess the issue is caused by concurrent downloading of dataset? |
I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔 |
…enario using DDP example
…de concurrently by using pre-downloaded dataset
e7edc7b
to
c3f99cf
Compare
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -36,6 +36,7 @@ type Gpu struct { | |||
var ( | |||
NVIDIA = Gpu{ResourceLabel: "nvidia.com/gpu", PrometheusGpuUtilizationLabel: "DCGM_FI_DEV_GPU_UTIL"} | |||
AMD = Gpu{ResourceLabel: "amd.com/gpu"} | |||
CPU = Gpu{ResourceLabel: "cpu"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK you don't have to provide ResourceLabel
. For CPU it is not used, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking whether Gpu
should be replaced by Accelerator
to describe its purpose better as CPU is included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which can have a function isGpu
to be used to detect if it is GPU test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can then replace gpu.ResourceLabel != "cpu"
with accelerator.isGpu()
@@ -59,7 +63,7 @@ func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLa | |||
mnist := ReadFile(test, "resources/mnist.py") | |||
requirementsFileName := ReadFile(test, requirementsFile) | |||
|
|||
if numGpus > 0 { | |||
if workerReplicas*numProcPerNode > 0 && gpu.ResourceLabel != "cpu" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to check workerReplicas*numProcPerNode
?
pip install --no-cache-dir -r /mnt/files/requirements.txt --target=/tmp/lib && \ | ||
python /mnt/files/mnist.py --epochs 3 --save-model --output-path /mnt/output --backend %s`, backend), | ||
echo "Downloading MNIST dataset..." && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - have you considered using init container for downloading the dataset?
Downloading dataset and running the tests can be seen as two phases.
Description
How Has This Been Tested?
Following tests are executed on a cluster having NVIDIA GPUs :
TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)
TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s
TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s
TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s
Following tests are executed on a cluster having AMD GPUs :
TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s
TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s
TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s
TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s
Merge criteria: