Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Questions]: How to implement NF4/NF2 matmul kernel function? #286

Open
llCurious opened this issue Jan 24, 2024 · 1 comment
Open

[Questions]: How to implement NF4/NF2 matmul kernel function? #286

llCurious opened this issue Jan 24, 2024 · 1 comment

Comments

@llCurious
Copy link

hi @TimDettmers .

The paper shows that you quantize the weights to 2/4 bits using NF format. I wonder how to you handle the input activations (denoted as x). Is x also quantized to 2/4 bits?

If not, do you dequantize the quantized weights to float format to perform the matmul using float kernels?
This seems to slow down the inference efficiency despite lower memory footprint.

@neoweasley
Copy link

The paper "QLORA: Efficient Finetuning of Quantized LLMs" says "In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to BFloat16, and then perform a matrix multiplication in 16-bit." As far as i'm concerned, NF16 and Double Quantization may slow down the training and inference efficiency, which means this work sacrifices time and performance for memory. However, this paper reports "24hours of finetuning on a single GPU", and I wonder how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants