quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

t-vi · 2024-09-06T07:21:00Z

Currently the BitsAndBytesLinearQuant4bit for submodule always calls bitsandbytes.functional.quantize_4bit. This is somewhat touchy for CPU tensors because quantize_4bit only works on GPU tensors but it is outright not so nice for meta tensors, where we only would need to get the right shapes.

lightning-thunder/thunder/transforms/quantization.py

Lines 93 to 103 in e64d347

    
           def quantize_weight(self, w): 
        
               # todo: revisit staying on CPU when bnb supports it 
        
               if w.device.type == "meta": 
        
                   w_work = torch.zeros_like(w, device="cuda") 
        
               elif w.device.type != "cuda": 
        
                   with torch.no_grad(): 
        
                       w_work = w.to("cuda") 
        
               else: 
        
                   w_work = w 
        
               return bitsandbytes.functional.quantize_4bit(w_work, quant_type="nf4")

The text was updated successfully, but these errors were encountered:

tombawor · 2024-09-08T21:52:53Z

@t-vi
Should we reshape meta into two dimension tensor with torch.uint8 like GPU result?
Should we execute PyTorch's 8-bit quantization for CPU?

tombawor · 2024-09-09T22:00:42Z

Should we implement dedicated QuantState class for meta and cpu to return tensor along with its corresponding quantization state as for gpu?

t-vi · 2024-09-10T09:32:08Z

Hi @tombawor , thank you for your interest.

I don't think we need a new class, just functions to complement bitsandbytes.functional.quantize_4bit(w, quant_type="nf4") for meta and cpu inputs (to return a tensor on w.device and a quant state with tensors in w.device).
Ideally, the quantize_weight function should have exactly the same inputs and outputs, except that all tensors stay on the device they are at, so same shape and quant state as if we called bitsandbytes.functional.quantize_4bit.
We could also offer it to bitsandbytes if they're interested.

tombawor · 2024-09-18T20:46:31Z

There’s a multi-backend effort under way which is currently in alpha release for bitsandbytes.
This is cpu implementation from bitsandbytes.

t-vi added good first issue Good for newcomers transforms labels Sep 6, 2024

t-vi mentioned this issue Sep 7, 2024

feat: rough draft of lora linear #1022

Merged

4 tasks

tombawor linked a pull request Sep 23, 2024 that will close this issue

quantization_cpu base version #1190

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

t-vi commented Sep 6, 2024

tombawor commented Sep 8, 2024

tombawor commented Sep 9, 2024

t-vi commented Sep 10, 2024

tombawor commented Sep 18, 2024

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

Comments

t-vi commented Sep 6, 2024

tombawor commented Sep 8, 2024

tombawor commented Sep 9, 2024

t-vi commented Sep 10, 2024

tombawor commented Sep 18, 2024