-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not run #135
Comments
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer Load the modelmodel_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' Define the device before using itdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu') Move the model to the selected devicemodel.to(device) Setup Inference Modetokenizer.add_bos_token = False Optional: torch compile for faster inferencemodel = torch.compile(model) # You might want to enable this for potential speedupdef chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
Now you can call the function:results = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=100, device=device) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
colab t4 |
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer Device configurationdevice = 'cuda' if torch.cuda.is_available() else 'cpu' Load the quantized modelquantized_model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' Load tokenizertokenizer = AutoTokenizer.from_pretrained(quantized_model_id) def debug_tensor_devices(inputs): def chat_processor(chat, current_model, current_tokenizer, max_new_tokens=100, do_sample=True, device=device):
Test with explicit error handlingquestion = "What is 2 + 2?" Fetching 9 files: 100% Starting chat_processor with device: cuda Input tensor devices: Generation parameters devices: User: What is 2 + 2? Error during processing: Full traceback: |
What is the solution? |
All ideas for 1bit, 2bit never work |
Hi, sorry I don't understand, what is the problem exactly? |
What are the versions and where is the code used and is there Colab T4? |
it run in colab t4 very goooooooood |
mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq not run never |
Hi, can you please explain what is the problem? |
I uploaded the Colab pages, the models work for Lama 2 and the model for the kvv The problem is in the Lama 3 model, and I don’t know where the problem is. |
Is there a Python code to run hqq models without problems with its libraries and versions as I did with the Llama 2 model it worked because of the specific library versions and the code is complete |
|
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
|
Try this import torch
from transformers import AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
from hqq.utils.patching import *
from hqq.utils.generation_hf import HFGenerator
#Load the model
###################################################
model_id = 'mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq'
model = AutoHQQHFModel.from_quantized(model_id, cache_dir='.', compute_dtype=torch.float16, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id) |
I have updated the doc for all the models. Basically, the old models need the following versions in order to work:
The newer models would use |
Thank you, I will try and let you know the results. |
It worked, thank you. But the problem is in the back-end. I have uploaded a Colab T4 page for you. If it is possible, I will complete the code that you sent. I am working in Colab T4, which does not support flash attention. |
from hqq.utils.patching import prepare_for_inference HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) 100%|██████████| 225/225 [00:00<00:00, 8980.70it/s]
|
https://github.com/werruww/hqq-/blob/main/succ_hqq.ipynb import torch #Load the model from hqq.utils.patching import prepare_for_inference HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) gen = HFGenerator(model, tokenizer, max_new_tokens=5, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while gen.generate("What is the result of the following addition operation 34+67?", print_tokens=True) |
from hqq.utils.patching import prepare_for_inference HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) I made this part like this so it doesn't cause problems. What do you think? |
16m 57s colab t4 |
import torch #Load the model patch_linearlayers(model, patch_add_quant_config, model.eval(); #Use optimized inference kernels #Generate gen.generate("Write an essay about large language models", print_tokens=True) Warning: failed to import the Marlin backend. Check if marlin is correctly installed if you want to use the Marlin backend (https://github.com/IST-DASLab/marlin).
|
import torch #Load the model from hqq.utils.patching import prepare_for_inference HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Warmup import transformers def chat_processor(chat, max_new_tokens=100, do_sample=True):
################################################################################################ Setting
|
import torch #Load the model HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Warmup def chat_processor(chat, max_new_tokens=100, do_sample=True):
################################################################################################ It works but its answers are very bad |
Thanks for reporting, I can indeed reproduce the issue and just made a fix. and use the new code here: https://huggingface.co/mobiuslabsgmbh/Llama-3-8b-instruct_2bitgs64_hqq |
import torch #Settings #Load the model #Use optimized inference kernels #Generate gen.generate("Write an essay about large language models", print_tokens=True) /usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: |
it run very gooooooooooooooooooooooooood thank you mobicham |
colabb t4
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
#Load the model
model_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq'
model = HQQModelForCausalLM.from_quantized(model_id, adapter='adapter_v0.1.lora')
tokenizer = AutoTokenizer.from_pretrained(model_id)
#Setup Inference Mode
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
if not tokenizer.pad_token: tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.config.use_cache = True
model.eval();
Optional: torch compile for faster inference
model = torch.compile(model)
#Streaming Inference
import torch, transformers
from threading import Thread
def chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
tokenizer.use_default_system_prompt = False
streamer = transformers.TextIteratorStreamer(tokenizer,skip_prompt=True, skip_special_tokens=True)
outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000).to(cuda)
8]
1
outputs = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=1000).to(cuda)
Exception in thread Thread-17 (generate):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3206, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1190, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 921, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 158, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
The text was updated successfully, but these errors were encountered: