Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf-to-ggml.py CUDA out of memory #35

Open
Alex20129 opened this issue Mar 1, 2024 · 0 comments
Open

convert-hf-to-ggml.py CUDA out of memory #35

Alex20129 opened this issue Mar 1, 2024 · 0 comments

Comments

@Alex20129
Copy link

i've tried to convert model from HF to GGML format:

python3 convert-hf-to-ggml.py ../starcoderbase_int8

and got an error:

Loading model:  ../starcoderbase_int8
Loading checkpoint shards:   ...
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2901, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3258, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param, fp16_statistics=fp16_statistics)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 10.90 GiB total capacity; 9.21 GiB already allocated; 568.69 MiB free; 9.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And next I've tried to force it to run on the CPU:

export CUDA_VISIBLE_DEVICES=""
python3 convert-hf-to-ggml.py ../starcoderbase_int8

Then, I got this:

Loading model:  ../starcoderbase_int8
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2370, in from_pretrained
    raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant