[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

tmostak · 2024-06-26T23:25:22Z

🐛 Bug

I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.

Below is my nvidia-smi output during the run.

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Wed Jun 26 23:23:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   51C    P0             202W / 400W |  52743MiB / 81920MiB |     94%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   48C    P0             214W / 400W |  53183MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   44C    P0             151W / 400W |  53149MiB / 81920MiB |     46%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   48C    P0             202W / 400W |  53529MiB / 81920MiB |     38%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   49C    P0             153W / 400W |  53513MiB / 81920MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   44C    P0             211W / 400W |  53087MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   47C    P0             215W / 400W |  53567MiB / 81920MiB |     73%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   50C    P0              98W / 400W |  52923MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1407823      C   ...nvs/h2o_llm_grid_search/bin/python3    52730MiB |
|    1   N/A  N/A   1407824      C   ...nvs/h2o_llm_grid_search/bin/python3    53170MiB |
|    2   N/A  N/A   1407825      C   ...nvs/h2o_llm_grid_search/bin/python3    53136MiB |
|    3   N/A  N/A   1407826      C   ...nvs/h2o_llm_grid_search/bin/python3    53516MiB |
|    4   N/A  N/A   1407827      C   ...nvs/h2o_llm_grid_search/bin/python3    53500MiB |
|    5   N/A  N/A   1407828      C   ...nvs/h2o_llm_grid_search/bin/python3    53074MiB |
|    6   N/A  N/A   1407829      C   ...nvs/h2o_llm_grid_search/bin/python3    53554MiB |
|    7   N/A  N/A   1407830      C   ...nvs/h2o_llm_grid_search/bin/python3    52910MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

psinger · 2024-07-11T12:38:10Z

I can confirm the same observation. Did you try if single gpu is different?

But in general they use a custom code for the model not directly integrated into HF. MoE models frequently have their hic ups in terms of runtime also.

If you have experience with their models, happy for some investigations and contributions.

tmostak added the type/bug Bug in code label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

tmostak commented Jun 26, 2024

psinger commented Jul 11, 2024

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

Comments

tmostak commented Jun 26, 2024

🐛 Bug

psinger commented Jul 11, 2024