Replies: 1 comment 2 replies
-
Hey @msmmpts, there are a couple things to point out. First is that it looks like the Ray backend is not being selected for some reason. Can you share the command you're using to run the training job? Are you using the Python API or the Ludwig CLI? The reason why I assume this is the case is because we currently have some logic that will raise an error if you try to use the Ray backend with We're actually working on a fix for this this week (cc @arnavgarg1) that will enable you to use data parallelism with quantized LLMs. That should address the issue you're seeing here where only on GPU is being used at a time. |
Beta Was this translation helpful? Give feedback.
-
HI,
I was attempting to run distributed training on a kubernetes pod with 4 * NVIDIA T4's.
Here are my observations :
Can anyone advise how do we get all GPU's to be utilised?
`qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: meta-llama/Llama-2-7b-chat-hf
)`
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions