Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py script crashes when using HQQ #37

Open
rationalism opened this issue Mar 24, 2024 · 3 comments
Open

train.py script crashes when using HQQ #37

rationalism opened this issue Mar 24, 2024 · 3 comments

Comments

@rationalism
Copy link

Here's the command I ran:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 1 \
--context_length 1024 \
--precision bf16 \
--train_type hqq_lora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 8 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 16 \
       --no_sync true

this crashes with the stack trace:

Creating model 0
Loading model 0
Loading & Quantizing Model Shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [03:02<00:00, 12.18s/it]
Model created 0 0.067 GB
LoRA layers added 0 0.067 GB
Wrapping model w/ FSDP 0
Traceback (most recent call last):
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 953, in <module>
    def main(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 125, in call_parse
    return _f()
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 1026, in main
    mp.spawn(fsdp_main,
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 703, in fsdp_main
    model = FSDP(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  [Previous line repeated 1 more time]
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 548, in _init_param_handle_from_module
    _materialize_with_param_init_fn(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 851, in _materialize_with_param_init_fn
    param_init_fn(module)
  File "/home/alyssa/lm_fun/fsdp_qlora/train.py", line 713, in <lambda>
    param_init_fn=lambda module: module.to_empty(device=torch.device("cuda"), recurse=False)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 485, in to_empty
    return self.cuda(device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 419, in cuda
    self.W_q.data, self.meta = Quantizer.cuda(self.W_q.data, self.meta, device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 220, in cuda
    return Quantizer.to_inplace(W_q, meta, device=device)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/hqq/core/quantize.py", line 181, in to_inplace
    W_q = W_q.to(device).contiguous()
NotImplementedError: Cannot copy out of meta tensor; no data!

@Xynonners
Copy link

same issue here.

@catid
Copy link

catid commented Apr 23, 2024

I think this is why the README says: Pin commit to 72b2b641aadc44a7ded6b243915f90df3b3be385 for FSDP compatibility, until to_empty() method is fixed.

@Xynonners
Copy link

I think this is why the README says: Pin commit to 72b2b641aadc44a7ded6b243915f90df3b3be385 for FSDP compatibility, until to_empty() method is fixed.

seems to be fixed in this commit 9e8928f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants