Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

Open
tmostak opened this issue Jun 26, 2024 · 1 comment
Open

[BUG] Training DeepSeek-Coder-V2-Lite-Base MOE inordinately slow #764

tmostak opened this issue Jun 26, 2024 · 1 comment
Labels
type/bug Bug in code

Comments

@tmostak
Copy link

tmostak commented Jun 26, 2024

🐛 Bug

I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.

Below is my nvidia-smi output during the run.

(base) ubuntu@207-211-184-180:~$ nvidia-smi
Wed Jun 26 23:23:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:08:00.0 Off |                    0 |
| N/A   51C    P0             202W / 400W |  52743MiB / 81920MiB |     94%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:09:00.0 Off |                    0 |
| N/A   48C    P0             214W / 400W |  53183MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:0A:00.0 Off |                    0 |
| N/A   44C    P0             151W / 400W |  53149MiB / 81920MiB |     46%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:0B:00.0 Off |                    0 |
| N/A   48C    P0             202W / 400W |  53529MiB / 81920MiB |     38%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:0C:00.0 Off |                    0 |
| N/A   49C    P0             153W / 400W |  53513MiB / 81920MiB |     79%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   44C    P0             211W / 400W |  53087MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:0E:00.0 Off |                    0 |
| N/A   47C    P0             215W / 400W |  53567MiB / 81920MiB |     73%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   50C    P0              98W / 400W |  52923MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1407823      C   ...nvs/h2o_llm_grid_search/bin/python3    52730MiB |
|    1   N/A  N/A   1407824      C   ...nvs/h2o_llm_grid_search/bin/python3    53170MiB |
|    2   N/A  N/A   1407825      C   ...nvs/h2o_llm_grid_search/bin/python3    53136MiB |
|    3   N/A  N/A   1407826      C   ...nvs/h2o_llm_grid_search/bin/python3    53516MiB |
|    4   N/A  N/A   1407827      C   ...nvs/h2o_llm_grid_search/bin/python3    53500MiB |
|    5   N/A  N/A   1407828      C   ...nvs/h2o_llm_grid_search/bin/python3    53074MiB |
|    6   N/A  N/A   1407829      C   ...nvs/h2o_llm_grid_search/bin/python3    53554MiB |
|    7   N/A  N/A   1407830      C   ...nvs/h2o_llm_grid_search/bin/python3    52910MiB |
+---------------------------------------------------------------------------------------+
@tmostak tmostak added the type/bug Bug in code label Jun 26, 2024
@psinger
Copy link
Collaborator

psinger commented Jul 11, 2024

I can confirm the same observation. Did you try if single gpu is different?

But in general they use a custom code for the model not directly integrated into HF. MoE models frequently have their hic ups in terms of runtime also.

If you have experience with their models, happy for some investigations and contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

2 participants