-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed still gives CUDA-out-of-memory issue #2302
Comments
I have updated the optimization pass as follows, however I am getting another different error related to
|
@buttercutter, can you please try |
@tjruwase
|
@buttercutter, got it. This out-of-memory with |
@tjruwase From the code link you provided just above,
|
[phung@archlinux gdas]$ git status --short
M ds_config.json
M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf
# import numpy as np
+VISUALIZER = 0
+DEBUG = 0
+
# deepspeed zero offload https://www.deepspeed.ai/getting-started/
# https://github.com/microsoft/DeepSpeed/issues/2029
USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
import argparse
import deepspeed
-VISUALIZER = 0
-DEBUG = 0
+ from deepspeed.runtime.utils import see_memory_usage
+
logdir = 'runs/gdas_experiment_1'
if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
while not_converged:
print("run_num = ", run_num)
+ see_memory_usage(f'memory usage before train_NN()', force=True)
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+ see_memory_usage(f'memory usage after train_NN()', force=True)
+
print("Finished train_NN()")
if VISUALIZER or DEBUG:
[phung@archlinux gdas]$
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
{
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1,
- "optimizer": {
+ "train_micro_batch_size_per_gpu": 8,
+ "steps_per_print": 1,
+ "optimizer": {
"type": "AdamW",
"params": {
"lr": 0.05
}
- }
+ },
+
+ "zero_optimization": {
+ "stage": 1,
+ "contiguous_gradients": true,
+ "stage3_max_live_parameters": 1e9,
+ "stage3_max_reuse_distance": 1e9,
+ "stage3_prefetch_bucket_size": 1e7,
+ "stage3_param_persistence_threshold": 1e5,
+ "reduce_bucket_size": 1e7,
+ "sub_group_size": 1e9,
+ "offload_optimizer": {
+ "device": "cpu"
+ },
+ "offload_param": {
+ "device": "cpu"
+ }
+ }
}
[phung@archlinux gdas]$ @tjruwase With the above modifications to Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)]
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
run_num = 0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.1 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.19 GB, percent = 33.0%
Finished train_NN() |
@tjruwase Sorry that I had put
|
Unfortunately, that is not very useful. Can you please add |
Attached : gdas.ipynb.zip @tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to |
@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the |
@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ? # self-defined initial NAS architecture, for supernet architecture edge weight training
def forward_edge(self, x):
self.__freeze_f()
self.__unfreeze_w()
# Refer to GDAS equations (5) and (6)
# if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
# It's not required, but you don't know, which index is one hot encoded 1.
# https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
# See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
chosen_edge = torch.argmax(gumbel, dim=0) # converts one-hot encoding into integer
return chosen_edge |
@buttercutter, I don't understand what |
@tjruwase Sorry for the wording disambiguation, it should be Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond. [2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
File "gdas.py", line 962, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 783, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 556, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 442, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 320, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 277, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 167, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) |
@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain. |
@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a |
@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs |
@tjruwase if I use Traceback (most recent call last):
File "gdas.py", line 947, in <module>
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
File "gdas.py", line 690, in train_NN
Ltrain = criterion(NN_output, NN_train_labels)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
label_smoothing=self.label_smoothing)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1). |
@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of |
@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ? |
@buttercutter, thanks for the update. No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training
ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:
|
@tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ? and how would I perform checkpointing.configure() for this particular gumbel-max function : as for the DeepSpeed Config json file, how shall I properly modify the following params ? "activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
} |
@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.
However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works. |
@tjruwase I used [2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0% |
@tjruwase if I stick with |
@tjruwase it is a bit strange that What do you suggest then ? See the following: File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) |
@tjruwase see the following log which seems not that useful for further debugging
|
The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct? |
@tjruwase Not correct, the log printout from May I ask how shall I modify the debug printout flow such that it only logs printout from |
I see, thanks for the clarification. I thought the 2nd was also forward pass because of multiple You don't need to restrict |
Also, can you please share the memory usage for batch sizes 1, 2, and 4, so we can understand any pattern from increasing batch size? Thanks! |
You are right about those 2 The reason I asked for Could you advise about limiting |
I googled but there is no specific built-in PyTorch variable that could actually differentiate forward and backward pass during runtime execution ? Please correct me if I miss anything, |
Can you define you own global variable that you set/clear around the backward call? |
I followed your suggestion on using global variable, see the log below: Here is the exact source files used to obtain the following log snippet
|
Thanks for the sharing the log. I have a few questions
gc.collect()
torch.cuda.empty_cache() |
I have attached all the relevant source files here
|
If you checked the modified gdas.py with your suggestion with the CUDA oom error log , can I imply that it is not due to memory fragmentation as you had suggested earlier ? Please advise. |
I cannot comment on the memory pattern for
Please suggest any possible solution in this particular case. |
I think you misunderstood my point, so let me restate:
Hope that is helpful. |
In this case then, I will proceed with the use of activation checkpointing. It seems that you are right about the difficulty of implementing custom forward function with activation checkpointing given that my current model architecture is based on network architecture search approach, where there are multiple parallel edges (different types of NN operations) between two computation nodes. How would I start in this case ? |
@tjruwase could you guide on how to properly wrap the forward function caller around deepspeed activation check pointing code ? |
Since custom forward function with activation checkpointing is a bit difficult to implement in my code, would gradient accumulation trick helps instead ? This
|
Yes, you can also use gradient accumulation, but it won't get the same throughput as a larger batch size. You should add |
@tjruwase Gradient accumulation trick is going to use up more memory, so it will not solve the OOM issue. I still need to code the custom forward function for enabling activation checkpointing. |
If you look at the following actual forward function that I need to modify or customize for gradient checkpointing, you would notice that it does not really return something similar to your So, how would I make use of mpu.checkpoint() function in this particular case ? Is torch.utils.checkpoint alright with checkpointing using values other than losses ? # self-defined initial NAS architecture, for supernet architecture edge weight training
def forward_edge(self, x):
self.__freeze_f()
self.__unfreeze_w()
# Refer to GDAS equations (5) and (6)
# if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
# It's not required, but you don't know, which index is one hot encoded 1.
# https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
# See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
chosen_edge = torch.argmax(gumbel, dim=0) # converts one-hot encoding into integer
return chosen_edge |
@buttercutter, gradient checkpointing functions like Here is another use of gradient checkpointing, for bert, perhaps that could be helpful. Another point to keep in mind is that gradient checkpointing is most useful when applied to |
We have analyzed this previously, feel free to pinpoint anything I have missed out though. See #2302 (comment) and #2302 (comment) |
Sorry, I may have forgotten critical findings along the way. If it is not too much trouble, it might still be worthwhile to measure memory usage of each operation in |
@tjruwase it seems that the memory access pattern (here attached the latest gdas.ipynb logs) had changed significantly since I last debugged a month ago. The traceback is now pointing to
|
May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?
See this for historical tracking purpose.
The text was updated successfully, but these errors were encountered: