Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed still gives CUDA-out-of-memory issue #2302

Open
buttercutter opened this issue Sep 7, 2022 · 54 comments
Open

DeepSpeed still gives CUDA-out-of-memory issue #2302

buttercutter opened this issue Sep 7, 2022 · 54 comments
Assignees
Labels
question Further information is requested training

Comments

@buttercutter
Copy link

May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?

image

See this for historical tracking purpose.

@buttercutter buttercutter added the bug Something isn't working label Sep 7, 2022
@tjruwase
Copy link
Contributor

tjruwase commented Sep 7, 2022

DeepSpeed's memory optimizations are enabled via ZeRO stages >=1 and most of them are effective in multi-gpu cases. Your log shows that ZeRO is not enabled.
image

Please see these tutorials

  1. https://www.deepspeed.ai/tutorials/zero/
  2. https://www.deepspeed.ai/tutorials/zero-offload/

@buttercutter
Copy link
Author

[2022-09-09 15:01:50,764] [INFO] [logging.py:68:log_dist] [Rank 0] step=1248, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:50,765] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=14.454011142357578, CurrSamplesPerSec=9.851701418075939, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:51,547] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:51,547] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=14.449232505063256, CurrSamplesPerSec=10.233603997501561, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:52,349] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:52,349] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=14.444045524931788, CurrSamplesPerSec=9.977597824784409, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 737, in train_NN
    torch.save(graph, path)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 589, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn'
[6eb301421b63:00906] *** Process received signal ***
[6eb301421b63:00906] Signal: Segmentation fault (11)
[6eb301421b63:00906] Signal code: Address not mapped (1)
[6eb301421b63:00906] Failing at address: 0x7f4e4f16420d
[6eb301421b63:00906] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f4e52011980]
[6eb301421b63:00906] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f4e51c50775]
[6eb301421b63:00906] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f4e524bbe44]
[6eb301421b63:00906] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f4e51c51605]
[6eb301421b63:00906] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f4e524b9cb3]
[6eb301421b63:00906] *** End of error message ***

I have updated the optimization pass as follows, however I am getting another different error related to pickler above.

{
    "train_micro_batch_size_per_gpu": 8,
    "steps_per_print": 1, 
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.05
        }
    },

    "zero_optimization": {
        "stage": 3,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 1e7,
        "stage3_param_persistence_threshold": 1e5,
        "reduce_bucket_size": 1e7,
        "sub_group_size": 1e9,
        "offload_optimizer": {
            "device": "cpu"
         },
        "offload_param": {
            "device": "cpu"
       }
   }
}

@tjruwase
Copy link
Contributor

tjruwase commented Sep 9, 2022

@buttercutter, can you please try stage == 1? We can return to this stage 3 issue later.

@buttercutter
Copy link
Author

@tjruwase "stage": 1 resulted in the following out-of-memory issue

[2022-09-09 15:45:39,534] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.891309503059055, CurrSamplesPerSec=12.814082860076859, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,142] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.87964277609462, CurrSamplesPerSec=13.148494490509256, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,755] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.867808169940616, CurrSamplesPerSec=13.059329395683946, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 955, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 780, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 553, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 439, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 317, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 274, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 164, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 149, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6eb301421b63:01205] *** Process received signal ***
[6eb301421b63:01205] Signal: Segmentation fault (11)
[6eb301421b63:01205] Signal code: Address not mapped (1)
[6eb301421b63:01205] Failing at address: 0x7f556779a20d
[6eb301421b63:01205] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f556a647980]
[6eb301421b63:01205] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f556a286775]
[6eb301421b63:01205] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f556aaf1e44]
[6eb301421b63:01205] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f556a287605]
[6eb301421b63:01205] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f556aaefcb3]
[6eb301421b63:01205] *** End of error message ***

@tjruwase
Copy link
Contributor

tjruwase commented Sep 9, 2022

@buttercutter, got it. This out-of-memory with stage 1 is quite strange because the model seems to train for a while, completing multiple iterations, before failing. The compute and memory pattern of model training is typically regular so that any out-of-memory issues should show up within one or two iterations. Is it possible that you have a memory leak? Can you please use deepspeed memory profiler as discussed here? Thanks!

@buttercutter
Copy link
Author

@tjruwase From the code link you provided just above,

  1. What is forward_step_func ? I suppose this is not just simply the model wrapper as it feeds in args.curriculum_learning ?
  2. Do I actually need to use from megatron import get_timers ? Since this is a timer, does this actually log the memory profile for each backpropagation paths ?

@tjruwase
Copy link
Contributor

  1. forward_step_func is just the forward pass call. Actually, it could be any function that are you interested in how it affects memory.
  2. No, you don't need timers import from megatron. Sorry, I just noticed that see_memory_usage() is commented out in the code snippet.
    image

Let's say you had a function foo, we can profile memory usage like below.

 from deepspeed.runtime.utils import see_memory_usage

 see_memory_usage(f'memory usage before foo', force=True)
 foo(...)
see_memory_usage(f'memory usage after foo', force=True)
`

@buttercutter
Copy link
Author

buttercutter commented Sep 10, 2022

[phung@archlinux gdas]$ git status --short
 M ds_config.json
 M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf
 
 # import numpy as np
 
+VISUALIZER = 0
+DEBUG = 0
+
 # deepspeed zero offload https://www.deepspeed.ai/getting-started/
 # https://github.com/microsoft/DeepSpeed/issues/2029
 USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
     import argparse
     import deepspeed
 
-VISUALIZER = 0
-DEBUG = 0
+    from deepspeed.runtime.utils import see_memory_usage
+
 logdir = 'runs/gdas_experiment_1'
 
 if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
     while not_converged:
         print("run_num = ", run_num)
 
+        see_memory_usage(f'memory usage before train_NN()', force=True)
         ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+        see_memory_usage(f'memory usage after train_NN()', force=True)
+
         print("Finished train_NN()")
 
         if VISUALIZER or DEBUG:
[phung@archlinux gdas]$ 
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
 {
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1, 
- "optimizer": {
+    "train_micro_batch_size_per_gpu": 8,
+    "steps_per_print": 1, 
+    "optimizer": {
         "type": "AdamW",
         "params": {
             "lr": 0.05
         }
-    }
+    },
+
+    "zero_optimization": {
+        "stage": 1,
+        "contiguous_gradients": true,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_prefetch_bucket_size": 1e7,
+        "stage3_param_persistence_threshold": 1e5,
+        "reduce_bucket_size": 1e7,
+        "sub_group_size": 1e9,
+        "offload_optimizer": {
+            "device": "cpu"
+         },
+        "offload_param": {
+            "device": "cpu"
+       }
+   }
 }
[phung@archlinux gdas]$ 

@tjruwase With the above modifications to gdas.py training source code as well as the deepspeed json config file, I have the following terminal output log which is contained inside gdas.ipynb.zip. May I ask if the following log implies any hints of the locations of the memory leak, if there are ?

Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)] 
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}

run_num =  0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]

[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.19 GB, percent = 33.0%
Finished train_NN()

@buttercutter
Copy link
Author

@tjruwase Sorry that I had put see_memory_usage for the wrong function call in the previous message.
It should be for train_architecture() instead of train_NN().

[2022-09-10 03:20:34,729] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=22.113168856352097, CurrSamplesPerSec=13.503006889446912, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,340] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,340] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=22.100969229749683, CurrSamplesPerSec=13.097596801095447, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,937] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,938] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=22.089474650856374, CurrSamplesPerSec=13.399278091939252, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:36,715] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 03:20:36,716] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,716] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Finished train_NN()
[2022-09-10 03:20:36,847] [INFO] [utils.py:827:see_memory_usage] memory usage before train_architecture()
[2022-09-10 03:20:36,847] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[a31af820d2a8:01199] *** Process received signal ***
[a31af820d2a8:01199] Signal: Segmentation fault (11)
[a31af820d2a8:01199] Signal code: Address not mapped (1)
[a31af820d2a8:01199] Failing at address: 0x7f3db597220d
[a31af820d2a8:01199] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f3db8416980]
[a31af820d2a8:01199] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f3db8055775]
[a31af820d2a8:01199] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f3db88c0e44]
[a31af820d2a8:01199] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f3db8056605]
[a31af820d2a8:01199] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f3db88becb3]
[a31af820d2a8:01199] *** End of error message ***

@tjruwase
Copy link
Contributor

Unfortunately, that is not very useful. Can you please add "memory_breakdown": true into your json config?

@buttercutter
Copy link
Author

Attached : gdas.ipynb.zip

@tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to "memory_breakdown": true

@buttercutter
Copy link
Author

buttercutter commented Sep 11, 2022

@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the gumbel function might had contributed to the issue, although I am not quite sure how this is possible. Is there a way to go around this ?

@buttercutter
Copy link
Author

@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ?

    # self-defined initial NAS architecture, for supernet architecture edge weight training
    def forward_edge(self, x):
        self.__freeze_f()
        self.__unfreeze_w()

        # Refer to GDAS equations (5) and (6)
        # if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
        # It's not required, but you don't know, which index is one hot encoded 1.
        # https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
        # See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163

        gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
        chosen_edge = torch.argmax(gumbel, dim=0)  # converts one-hot encoding into integer

        return chosen_edge

@tjruwase
Copy link
Contributor

@buttercutter, I don't understand what gumbel is doing. So, you can please clarify your expectations for DeepSpeed in this case, and what you mean by "further optimize"? Thanks!

@buttercutter
Copy link
Author

@tjruwase Sorry for the wording disambiguation, it should be memory offloading to cpu instead of further optimize.

Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond.

[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) 

@buttercutter
Copy link
Author

@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain.

See the difference between equations (3) and (5)

image

@tjruwase
Copy link
Contributor

@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a train_micro_batch_size_per_gpu of 1 to test this? Also, assuming you are now comfortable with using see_memory_usage(), can add it just before the gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) call?

@tjruwase
Copy link
Contributor

@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs

  1. PyTorch: https://pytorch.org/docs/stable/checkpoint.html
  2. DeepSpeed: https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html

@buttercutter
Copy link
Author

@tjruwase if I use train_micro_batch_size_per_gpu of 1, I have the following error which had been spotted, reported and resolved earlier in another previous github issue

Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 690, in train_NN
    Ltrain = criterion(NN_output, NN_train_labels)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).

@tjruwase
Copy link
Contributor

@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of 1 to match the command line value of 8. But now, since we are trying to profile the memory behavior without encountering out-of-memory errors, so it makes sense to modify the command line value to 1 to match the ds_config. Does that make sense?

@buttercutter
Copy link
Author

@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to 1 for testing purpose. So far, the GPU RAM memory usage stays around 4GB.

However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ?

@tjruwase
Copy link
Contributor

@buttercutter, thanks for the update.

No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training

  1. Model state: parameters, gradients, optimizer state
  2. Activations due to inputs

ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:

  1. Measure impact of batch size (activations) on memory by trying batch size 2 and 4
  2. Use see_memory_usage() to track memory utilization at vital points, such as before/after forward and before/after backward for the batch sizes that don't cause out-of-memory errors.
  3. Enable activation checkpointing to see the impact

@buttercutter
Copy link
Author

@tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ?

and how would I perform checkpointing.configure() for this particular gumbel-max function : gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) ?

as for the DeepSpeed Config json file, how shall I properly modify the following params ?

 "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
    }

@tjruwase
Copy link
Contributor

@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.

  1. Wrap the forward function of the client modeling code with deepspeed's checkpoint() as done in this file. In your case, I guess you want to wrap a caller of this.
  2. Configure the deepspeed config json by trying out different settings. I can help with that later.

However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works.

@buttercutter
Copy link
Author

buttercutter commented Sep 30, 2022

@tjruwase I used batch_size of 4, and it also resulted in memory usage of around 4GB with the help from see_memory_usage() on gumbel-softmax function, what do you suggest in this particular case before proceeding to using activation checkpointing ?

[2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%

@buttercutter
Copy link
Author

@tjruwase if I stick with batch_size of 4, the training process runs smoothly without cuda out-of-memory issue so far, but of course the training time is much longer compared to batch_size of 8, 16 or 32. So, I suppose it is an incentive to use activation checkpointing in this particular case ? could you advise on this ?

@buttercutter
Copy link
Author

buttercutter commented Oct 15, 2022

@tjruwase it is a bit strange that see_memory_usage() returned the same memory usage for both before and after gumbel_softmax() function

What do you suggest then ? See the following:

File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) 

@buttercutter
Copy link
Author

@tjruwase see the following log which seems not that useful for further debugging


[2022-11-09 04:45:52,214] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,215] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,215] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,345] [INFO] [utils.py:827:see_memory_usage] memory usage after variable ret
[2022-11-09 04:45:52,346] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,346] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,484] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,485] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,485] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
Traceback (most recent call last):
  File "gdas.py", line 960, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val=&#039;val&#039;)
  File "gdas.py", line 785, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 558, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 449, in forward
    self.nodes[n].forward(x2, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 322, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 279, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 169, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 154, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1899, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 15.75 MiB free; 13.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[3a98c6920156:00675] *** Process received signal ***
[3a98c6920156:00675] Signal: Segmentation fault (11)
[3a98c6920156:00675] Signal code: Address not mapped (1)
[3a98c6920156:00675] Failing at address: 0x7f7953ad920d
[3a98c6920156:00675] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f7956783980]
[3a98c6920156:00675] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f79563c2775]
[3a98c6920156:00675] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f7956c2de44]
[3a98c6920156:00675] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f79563c3605]
[3a98c6920156:00675] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f7956c2bcb3]
[3a98c6920156:00675] *** End of error message ***

@tjruwase
Copy link
Contributor

tjruwase commented Nov 9, 2022

The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct?

@buttercutter
Copy link
Author

buttercutter commented Nov 10, 2022

@tjruwase Not correct, the log printout from see_memory_usage() actually happened for both forward pass and backward pass.

May I ask how shall I modify the debug printout flow such that it only logs printout from see_memory_usage() ONLY for backward pass ?

@tjruwase
Copy link
Contributor

I see, thanks for the clarification. I thought the 2nd was also forward pass because of multiple forward* and references in the stack trace, as well as the softmax computation. So, does the backward pass also include a forward pass that causes this stack trace? I am trying to get a mental model of the execution flow as the model is behaving differently from my experience.

You don't need to restrict see_memory_usage for backwrad pass only. Also, can you share equivalent log snippet of the smaller micro-batch size that runs without OOM?

@tjruwase
Copy link
Contributor

Also, can you please share the memory usage for batch sizes 1, 2, and 4, so we can understand any pattern from increasing batch size? Thanks!

@buttercutter
Copy link
Author

You are right about those 2 see_memory_usage() in the log snippet above (which is using batch size of 256) were being executed during forward pass.

The reason I asked for see_memory_usage() to be only in the backwards pass is because logging printf() takes a very huge toll on overall running time for forward pass, and I am using limited computing unit inside google cloud colab session. So extra runtime will cost money in this case.

Could you advise about limiting see_memory_usage() to be just executed inside backwards pass because even smaller batch size of 2 or 4 will only worsen this extremely long runtime issue when see_memory_usage() is enabled for every forward pass ?

@tjruwase
Copy link
Contributor

You can call see_memory_usage() while setting force to True only for backward.
image

@buttercutter
Copy link
Author

I googled but there is no specific built-in PyTorch variable that could actually differentiate forward and backward pass during runtime execution ?

Please correct me if I miss anything,

@tjruwase
Copy link
Contributor

Can you define you own global variable that you set/clear around the backward call?

@buttercutter
Copy link
Author

buttercutter commented Nov 11, 2022

I followed your suggestion on using global variable, see the log below:

Here is the exact source files used to obtain the following log snippet

[2022-11-11 08:40:17,899] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:17,899] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,036] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:18,037] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,037] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,049] [INFO] [logging.py:68:log_dist] [Rank 0] step=2498, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:18,050] [INFO] [timer.py:207:stop] 0/2498, RunningAvgSamplesPerSec=3.066608367950589, CurrSamplesPerSec=2.2201241199743493, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:18,191] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:18,191] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,191] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,328] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:18,329] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,330] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,554] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:18,555] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,555] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,703] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:18,704] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,704] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:19,893] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:19,893] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:19,894] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,024] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:20,025] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,025] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,036] [INFO] [logging.py:68:log_dist] [Rank 0] step=2499, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:20,037] [INFO] [timer.py:207:stop] 0/2499, RunningAvgSamplesPerSec=3.066099503151551, CurrSamplesPerSec=2.168111862121388, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:20,174] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:20,175] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,175] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,308] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:20,309] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,309] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,531] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:20,532] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,532] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,666] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:20,667] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,667] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,834] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:21,835] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:21,835] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,971] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:21,972] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:21,972] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=2500, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:21,983] [INFO] [timer.py:207:stop] 0/2500, RunningAvgSamplesPerSec=3.0656260018592953, CurrSamplesPerSec=2.2124665882548116, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:22,127] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:22,128] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:22,128] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 963, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 788, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 559, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 449, in forward
    self.nodes[n].forward(x1, node_num=n, types=types) + \
  File "gdas.py", line 323, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 280, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 170, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 155, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1900, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.51 GiB already allocated; 3.75 MiB free; 13.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6b0e6168b82c:01203] *** Process received signal ***
[6b0e6168b82c:01203] Signal: Segmentation fault (11)
[6b0e6168b82c:01203] Signal code: Address not mapped (1)
[6b0e6168b82c:01203] Failing at address: 0x7fbd6e68520d
[6b0e6168b82c:01203] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fbd71530980]
[6b0e6168b82c:01203] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7fbd7116f775]
[6b0e6168b82c:01203] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7fbd719dae44]
[6b0e6168b82c:01203] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7fbd71170605]
[6b0e6168b82c:01203] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7fbd719d8cb3]
[6b0e6168b82c:01203] *** End of error message ***

@tjruwase
Copy link
Contributor

Thanks for the sharing the log. I have a few questions

  1. What is the batch size?
  2. It seems the train ran smoothly for 2500 steps without OOM. Is this correct? This is surprising because in my experience OOMs occur within the first 2 or 3 steps since memory usage of DL is quite stable.
  3. What is the meaning of this message that appears before the stack trace: Finished train_NN()
  4. Can you please add the following lines before each iteration? I am curious whether this is memory fragmentation.
        gc.collect()
        torch.cuda.empty_cache()

@buttercutter
Copy link
Author

I have attached all the relevant source files here

  1. batch size = 4

  2. Yes, ran smoothly for train_NN() before train_architecture(). Remember this is a bi-level optimization (See algorithm 1).

  3. Do we need to take care of the log snippet before Finished train_NN() since the issue is inside train_architecture() which is executed after Finished train_NN() ?

  4. Let me do as you suggested. Might I ask if this is to be put inside /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py ? Is it to be put before or after ret = y_hard - y_soft.detach() + y_soft ?

@buttercutter
Copy link
Author

@tjruwase

If you checked the modified gdas.py with your suggestion with the CUDA oom error log , can I imply that it is not due to memory fragmentation as you had suggested earlier ?

Please advise.

@tjruwase
Copy link
Contributor

Thanks for sharing the code and log. I agree that this is probably not a memory fragmentation issue. Also, you can remove the logging in train_NN() as that runs fine and the failure is in train_architecture().

Also, I think the following log that you collected from gumbel_softmax() might have the explanation; sorry for not realizing this earlier
image

Let's also combine with the OOM message:
image

There are two things to notice here:

  1. The Max_CA is 14GB before the code.
  2. The total GPU memory available to the process is 14.76GB, not the total 16GB.

This is already an OOM condition that was lucky not to have failed earlier, because any attempt to allocate even a small amount of intermediate memory, like 20MB in this case is sufficient to cause a failure.

I don't yet see the cause of this high memory usage. But it seems to me that the memory profile of train_NN() is very different from train_architecture(), do you agree?

@buttercutter
Copy link
Author

I cannot comment on the memory pattern for train_architecture() because its execution path was terminated due to oom in its first epoch.

train_NN() itself already consumed about 14GB of GPU memory, strange. Is it not the purpose of deepspeed to make sure that any function execution path will not consume this much GPU memory ?

Please suggest any possible solution in this particular case.

@tjruwase
Copy link
Contributor

I think you misunderstood my point, so let me restate:

  1. The memory log contains 4 numbers: memory consumed by model is labelled MA while memory consumed by PyTorch cache allocator is Max_CA. Please see here for further discussion of CUDA memory management.
  2. train_NN() consumes just 0.2GB, see the screenshot below of the log at the end of train_NN()

image

3. The model memory consumption in `train_architecture()` just before crash is `12.34GB`. Since this happening during `forward` it indicates activation memory consumption.

image

4. As I said earlier, are two major sources of memory consumption in model training. In this case, activation memory consumption seems to be the problem, and ZeRO is not designed for that. So, you need activation checkpointing. This is why it is important to understand the memory pattern of the execution.

Hope that is helpful.

@buttercutter
Copy link
Author

buttercutter commented Nov 19, 2022

In this case then, I will proceed with the use of activation checkpointing.

It seems that you are right about the difficulty of implementing custom forward function with activation checkpointing given that my current model architecture is based on network architecture search approach, where there are multiple parallel edges (different types of NN operations) between two computation nodes.

How would I start in this case ?

@buttercutter
Copy link
Author

@tjruwase tjruwase added question Further information is requested and removed bug Something isn't working labels Dec 12, 2022
@buttercutter
Copy link
Author

buttercutter commented Dec 20, 2022

@tjruwase

Since custom forward function with activation checkpointing is a bit difficult to implement in my code, would gradient accumulation trick helps instead ?

This jax code is modified from original huggingface version

        if len(accumulated_gradients) < training_args.gradient_accumulation_steps:
            accumulated_gradients.append(grad)
            new_state = state
        else:
            grad = jax.tree_multimap(lambda *x: jnp.sum(jnp.stack(x), axis=0), *accumulated_gradients)
            new_state = state.apply_gradients(grads=grad, dropout_rng=new_dropout_rng)
            accumulated_gradients = []

image

@tjruwase
Copy link
Contributor

Yes, you can also use gradient accumulation, but it won't get the same throughput as a larger batch size. You should add gradient_accumulation_steps to ds_config, as disccused here.

@buttercutter
Copy link
Author

@tjruwase Gradient accumulation trick is going to use up more memory, so it will not solve the OOM issue.

I still need to code the custom forward function for enabling activation checkpointing.

@buttercutter
Copy link
Author

@tjruwase

If you look at the following actual forward function that I need to modify or customize for gradient checkpointing, you would notice that it does not really return something similar to your hidden_states, moe_losses tuple.

So, how would I make use of mpu.checkpoint() function in this particular case ?

Is torch.utils.checkpoint alright with checkpointing using values other than losses ?

     # self-defined initial NAS architecture, for supernet architecture edge weight training
     def forward_edge(self, x):
         self.__freeze_f()
         self.__unfreeze_w()
 
         # Refer to GDAS equations (5) and (6)
         # if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
         # It's not required, but you don't know, which index is one hot encoded 1.
         # https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
         # See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
 
         gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
         chosen_edge = torch.argmax(gumbel, dim=0)  # converts one-hot encoding into integer
 
         return chosen_edge

@tjruwase
Copy link
Contributor

tjruwase commented Jan 8, 2023

@buttercutter, gradient checkpointing functions like mpu.checkpoint() and torch.utils.checkpoint are simply function wrappers and so can work with any input and return type. The key thing to observe in this example is how custom_forward() mimics the non-checkpointed forward logic.

Here is another use of gradient checkpointing, for bert, perhaps that could be helpful.

Another point to keep in mind is that gradient checkpointing is most useful when applied to forward() that consume a lot of memory. So, I suggest first memory profiling the forward() functions in your code to see which ones would provide the most memory savings.

@buttercutter
Copy link
Author

Another point to keep in mind is that gradient checkpointing is most useful when applied to forward() that consume a lot of memory. So, I suggest first memory profiling the forward() functions in your code to see which ones would provide the most memory savings.

We have analyzed this previously, feel free to pinpoint anything I have missed out though.

See #2302 (comment) and #2302 (comment)

@tjruwase
Copy link
Contributor

tjruwase commented Jan 9, 2023

Sorry, I may have forgotten critical findings along the way. If it is not too much trouble, it might still be worthwhile to measure memory usage of each operation in forward_edge(). A finer grained view of the memory behavior might give useful insights.

@buttercutter
Copy link
Author

buttercutter commented Jan 16, 2023

@tjruwase it seems that the memory access pattern (here attached the latest gdas.ipynb logs) had changed significantly since I last debugged a month ago.

The traceback is now pointing to autograd library instead of gumbel_softmax.
Could you advise ?

[2023-01-15 19:09:46,662] [INFO] [logging.py:68:log_dist] [Rank 0] step=2495, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:46,663] [INFO] [timer.py:197:stop] 0/2495, RunningAvgSamplesPerSec=3.0126676443301603, CurrSamplesPerSec=2.1106517284551995, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:46,808] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:46,808] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:46,809] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:46,956] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2023-01-15 19:09:46,957] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:46,957] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:47,178] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2023-01-15 19:09:47,179] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:47,179] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:47,316] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2023-01-15 19:09:47,317] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:47,317] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:48,492] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2023-01-15 19:09:48,493] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:48,493] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:48,636] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2023-01-15 19:09:48,637] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:48,637] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:48,649] [INFO] [logging.py:68:log_dist] [Rank 0] step=2496, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:48,649] [INFO] [timer.py:197:stop] 0/2496, RunningAvgSamplesPerSec=3.0122015929276613, CurrSamplesPerSec=2.1738385007924554, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:48,802] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:48,803] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:48,803] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:48,947] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2023-01-15 19:09:48,948] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:48,948] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:49,174] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2023-01-15 19:09:49,174] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:49,175] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:49,313] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2023-01-15 19:09:49,314] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:49,314] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:50,482] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2023-01-15 19:09:50,482] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:50,483] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:50,625] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2023-01-15 19:09:50,626] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:50,627] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:50,638] [INFO] [logging.py:68:log_dist] [Rank 0] step=2497, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:50,638] [INFO] [timer.py:197:stop] 0/2497, RunningAvgSamplesPerSec=3.0117412688356078, CurrSamplesPerSec=2.180631479454341, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:50,789] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:50,790] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:50,790] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:50,939] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2023-01-15 19:09:50,940] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:50,940] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:51,168] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2023-01-15 19:09:51,168] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:51,168] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:51,307] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2023-01-15 19:09:51,308] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:51,308] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:52,476] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2023-01-15 19:09:52,477] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:52,477] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:52,623] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2023-01-15 19:09:52,624] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:52,624] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.04 GB, percent = 31.9%
[2023-01-15 19:09:52,635] [INFO] [logging.py:68:log_dist] [Rank 0] step=2498, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:52,635] [INFO] [timer.py:197:stop] 0/2498, RunningAvgSamplesPerSec=3.0112720428617563, CurrSamplesPerSec=2.168382273493112, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:52,783] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:52,784] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:52,784] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:52,935] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2023-01-15 19:09:52,936] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:52,936] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:53,165] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2023-01-15 19:09:53,166] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:53,166] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:53,311] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2023-01-15 19:09:53,311] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:53,311] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:54,490] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2023-01-15 19:09:54,491] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:54,491] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:54,634] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2023-01-15 19:09:54,635] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:54,635] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:54,645] [INFO] [logging.py:68:log_dist] [Rank 0] step=2499, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:54,646] [INFO] [timer.py:197:stop] 0/2499, RunningAvgSamplesPerSec=3.0107880862910275, CurrSamplesPerSec=2.148806074829796, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:54,799] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:54,800] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:54,800] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:54,951] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2023-01-15 19:09:54,952] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:54,952] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:55,177] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2023-01-15 19:09:55,178] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:55,178] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:55,318] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2023-01-15 19:09:55,318] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:55,319] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:56,486] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2023-01-15 19:09:56,487] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:56,487] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:56,630] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2023-01-15 19:09:56,630] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:56,630] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
[2023-01-15 19:09:56,641] [INFO] [logging.py:68:log_dist] [Rank 0] step=2500, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2023-01-15 19:09:56,641] [INFO] [timer.py:197:stop] 0/2500, RunningAvgSamplesPerSec=3.0103234047778877, CurrSamplesPerSec=2.1729161539995805, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2023-01-15 19:09:56,788] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2023-01-15 19:09:56,789] [INFO] [utils.py:828:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-01-15 19:09:56,789] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 4.05 GB, percent = 31.9%
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 988, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 853, in train_architecture
    Lval.backward(retain_graph=True)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.41 GiB already allocated; 3.75 MiB free; 13.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f44bdd54280>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py", line 108, in __del__
TypeError: 'NoneType' object is not callable
[151da56746f4:01186] *** Process received signal ***
[151da56746f4:01186] Signal: Segmentation fault (11)
[151da56746f4:01186] Signal code: Address not mapped (1)
[151da56746f4:01186] Failing at address: 0x7fe247f7320d
[151da56746f4:01186] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fe24aa18980]
[151da56746f4:01186] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7fe24a657775]
[151da56746f4:01186] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7fe24aec2e44]
[151da56746f4:01186] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7fe24a658605]
[151da56746f4:01186] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7fe24aec0cb3]
[151da56746f4:01186] *** End of error message ***

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested training
Projects
None yet
Development

No branches or pull requests

2 participants