Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy) #1111

Open
t-vi opened this issue Sep 6, 2024 · 4 comments · May be fixed by #1190
Open
Labels

Comments

@t-vi
Copy link
Collaborator

t-vi commented Sep 6, 2024

Currently the BitsAndBytesLinearQuant4bit for submodule always calls bitsandbytes.functional.quantize_4bit. This is somewhat touchy for CPU tensors because quantize_4bit only works on GPU tensors but it is outright not so nice for meta tensors, where we only would need to get the right shapes.

def quantize_weight(self, w):
# todo: revisit staying on CPU when bnb supports it
if w.device.type == "meta":
w_work = torch.zeros_like(w, device="cuda")
elif w.device.type != "cuda":
with torch.no_grad():
w_work = w.to("cuda")
else:
w_work = w
return bitsandbytes.functional.quantize_4bit(w_work, quant_type="nf4")

@t-vi t-vi added good first issue Good for newcomers transforms labels Sep 6, 2024
@tombawor
Copy link

tombawor commented Sep 8, 2024

@t-vi
Should we reshape meta into two dimension tensor with torch.uint8 like GPU result?
Should we execute PyTorch's 8-bit quantization for CPU?

@tombawor
Copy link

tombawor commented Sep 9, 2024

Should we implement dedicated QuantState class for meta and cpu to return tensor along with its corresponding quantization state as for gpu?

@t-vi
Copy link
Collaborator Author

t-vi commented Sep 10, 2024

Hi @tombawor , thank you for your interest.

I don't think we need a new class, just functions to complement bitsandbytes.functional.quantize_4bit(w, quant_type="nf4") for meta and cpu inputs (to return a tensor on w.device and a quant state with tensors in w.device).
Ideally, the quantize_weight function should have exactly the same inputs and outputs, except that all tensors stay on the device they are at, so same shape and quant state as if we called bitsandbytes.functional.quantize_4bit.
We could also offer it to bitsandbytes if they're interested.

@tombawor
Copy link

There’s a multi-backend effort under way which is currently in alpha release for bitsandbytes.
This is cpu implementation from bitsandbytes.

@tombawor tombawor linked a pull request Sep 23, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants