Distributed Training in Ludwig #3662

msmmpts · 2023-09-25T05:55:57Z

msmmpts
Sep 25, 2023

HI,

I was attempting to run distributed training on a kubernetes pod with 4 * NVIDIA T4's.

Here are my observations :

About 2.2 GB of memory was utilised by all the four GPU's
Only one GPU was actively being utilised

Can anyone advise how do we get all GPU's to be utilised?

`qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: meta-llama/Llama-2-7b-chat-hf

input_features:
  - name: instruction
    type: text


output_features:
  - name: output
    type: text

preprocessing:
  split_probabilities: [0.85, 0.10, 0.05]

prompt:
  template: >-
    Below is an instruction that describes a task, paired with an input
    that provides further context. Write a response that appropriately
    completes the request.

    ### Instruction: {instruction}

    ### Input: {input}

    ### Response:

generation:
  temperature: 0.1
  max_new_tokens: 512

adapter:
  type: lora

quantization:
  bits: 4

backend:
  type: ray

trainer:
  type: finetune
  use_gpu: True 
  epochs: 5
  batch_size: 8
  eval_batch_size: 4
  gradient_accumulation_steps: 16
  learning_rate: 0.0001
  optimizer:
    type: adam
    params:
      eps: 1.e-8
      betas:
        - 0.9
        - 0.999
      weight_decay: 0
  learning_rate_scheduler:
    warmup_fraction: 0.03
    reduce_on_plateau: 0
"""

)`

Thanks in advance

tgaddair · 2023-09-25T16:49:48Z

tgaddair
Sep 25, 2023
Maintainer

Hey @msmmpts, there are a couple things to point out.

First is that it looks like the Ray backend is not being selected for some reason. Can you share the command you're using to run the training job? Are you using the Python API or the Ludwig CLI? The reason why I assume this is the case is because we currently have some logic that will raise an error if you try to use the Ray backend with quantization configured, so it's surprising to me that things didn't fail.

We're actually working on a fix for this this week (cc @arnavgarg1) that will enable you to use data parallelism with quantized LLMs. That should address the issue you're seeing here where only on GPU is being used at a time.

2 replies

tgaddair Sep 25, 2023
Maintainer

I can confirm I get the expected error when trying to train using quantization and the ray backend:

Traceback (most recent call last):
  File "/home/ray/train_code_alpaca.py", line 99, in <module>
    model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/api.py", line 313, in __init__
    self.config_obj = ModelConfig.from_dict(self._user_config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/schema/model_types/base.py", line 141, in from_dict
    config_obj: ModelConfig = schema.load(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/marshmallow_dataclass/__init__.py", line 741, in load
    return clazz(**all_loaded)
  File "<string>", line 18, in __init__
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/schema/model_types/base.py", line 83, in __post_init__
    get_config_check_registry().check_config(self)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/config_validation/checks.py", line 48, in check_config
    check_fn(config)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ludwig/config_validation/checks.py", line 614, in check_llm_quantization_backend_incompatibility
    raise ConfigValidationError(f"LLM with quantization requires the 'local' backend, found: '{backend_type}'")
ludwig.error.ConfigValidationError: LLM with quantization requires the 'local' backend, found: 'ray'
command terminated with exit code 1

msmmpts Sep 26, 2023
Author

Hi @tgaddair ,

Thanks for your response.

We were using the Python API for running this.

The screenshot that was posted correspond to the one without utilisation of ray backend and hence it might show that it worked.

We were trying out with ray being at the back end but the build was failing.

Since this support will be enabled with a future fix, this issue should be sorted out.

Should we add any other config apart from in the yaml config for multi-GPU support?

backend:
  type: ray

and run this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training in Ludwig #3662

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Distributed Training in Ludwig #3662

msmmpts Sep 25, 2023

Replies: 1 comment · 2 replies

tgaddair Sep 25, 2023 Maintainer

tgaddair Sep 25, 2023 Maintainer

msmmpts Sep 26, 2023 Author

msmmpts
Sep 25, 2023

Replies: 1 comment 2 replies

tgaddair
Sep 25, 2023
Maintainer

tgaddair Sep 25, 2023
Maintainer

msmmpts Sep 26, 2023
Author