NF4 quantization of linear layers without LoRA applied #1119

winglian · 2024-06-25T13:50:57Z

Context

What is the purpose of this PR? Is it to

[X ] add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses. #1093

Changelog

Reverts #658 to bring back FrozenNF4Linear. When quantize_base is set to true, all base weights for linear layers are quantized, even if they do not have LoRA applied to them.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

pytorch-bot · 2024-06-25T13:51:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1119

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0eb4ad7 with merge base b317c8f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-06-25T13:58:36Z

cc @msaroufim

janeyx99

Clarification question: does thid mean that quantize_base is the boolean with which one specifies the q in qLoRA, which is globally applied to all projs? (versus how LoRA could be applied to a subset of the parameters of q, k, v and not o).

winglian · 2024-07-01T18:07:39Z

running tune run lora_finetune_single_device --config llama3/8B_qlora_single_device
but only with lora_attn_modules: ['q_proj'] instead of all the attention linear layers, uses 15810MiB on main and 15000MiB on this branch.

EDIT: setting lora_attn_modules active across all 4 modules uses 15044MiB in this branch, which is expected to be more than the 15000MiB due to the optimizer, and also less than main because all the weights are quantized.

janeyx99

Rest of PR looks fine; we should implement FrozenNF4Linear as cleanly as possible though

janeyx99 · 2024-07-01T19:59:31Z

torchtune/modules/low_precision/nf4_linear.py

+        self.weight.requires_grad_(False)
+        self.nf4_weight = to_nf4(self.weight.data)
+        # re-register self.weight as the nf4 weight, so that the nf4 weight
+        # shows up as expected in .parameters, state_dict, etc.
+        self.weight = torch.nn.Parameter(self.nf4_weight, requires_grad=False)


Could we avoid accessing .data of self.weight?

I'm curious if there's a way to use swap_tensors here to accomplish this more cleanly. https://pytorch.org/docs/stable/generated/torch.utils.swap_tensors.html cc @mikaylagawarecki any ideas?

@winglian Could you explain the rationale for the .data access?

Yeah can we do something similar to what's currently done in LoRALinear for this?

This is simply reverting back the code that was deleted in https://github.com/pytorch/torchtune/pull/658/files#diff-74b0d911936ebd1d0e216004577afed27b84b6bfdff9c6a9a1a28f6fac054850L45

Removing the .data does seem to work as well, so I've checked in that change.

janeyx99

Also, what test cases can we write to verify functionality?

ebsmothers

Overall the changes look reasonable to me. Re @janeyx99's testing comment, a couple things I think we should add here:

(1) a unit test for nf4_linear (can probably look at #465 for ideas)
(2) some kind of e2e test. I am thinking: run one of our QLoRA recipes before and after the change, confirm that (a) the peak memory reduces as expected, and (b) we see no regression in eval metrics on the resulting fine-tuned checkpoint.

It's also likely that this will break some of our existing QLoRA tests. So e.g. the values here would need to be updated

ebsmothers · 2024-07-02T02:42:59Z

torchtune/modules/low_precision/nf4_linear.py

+        self.weight.requires_grad_(False)
+        self.nf4_weight = to_nf4(self.weight.data)
+        # re-register self.weight as the nf4 weight, so that the nf4 weight
+        # shows up as expected in .parameters, state_dict, etc.
+        self.weight = torch.nn.Parameter(self.nf4_weight, requires_grad=False)


Yeah can we do something similar to what's currently done in LoRALinear for this?

winglian added the enhancement New feature or request label Jun 25, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2024

janeyx99 reviewed Jun 25, 2024

View reviewed changes

janeyx99 reviewed Jul 1, 2024

View reviewed changes

ebsmothers reviewed Jul 2, 2024

View reviewed changes

winglian and others added 3 commits July 5, 2024 07:51

NF4 quantization of linear layers without LoRA applied

c972d50

chore: lint

8b1969b

don't reference .data directly for to_nf4 call

9918f2a

winglian force-pushed the quantize-linear branch from e01102b to 9918f2a Compare July 5, 2024 15:58

restore previously removed test_nf4_linear.py as well

0eb4ad7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF4 quantization of linear layers without LoRA applied #1119

NF4 quantization of linear layers without LoRA applied #1119

winglian commented Jun 25, 2024

pytorch-bot bot commented Jun 25, 2024 •

edited

Loading

joecummings commented Jun 25, 2024

janeyx99 left a comment

winglian commented Jul 1, 2024 •

edited

Loading

janeyx99 left a comment

janeyx99 Jul 1, 2024

mikaylagawarecki Jul 1, 2024

ebsmothers Jul 2, 2024

winglian Jul 5, 2024

janeyx99 left a comment

ebsmothers left a comment •

edited

Loading

ebsmothers Jul 2, 2024

NF4 quantization of linear layers without LoRA applied #1119

Are you sure you want to change the base?

NF4 quantization of linear layers without LoRA applied #1119

Conversation

winglian commented Jun 25, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Jun 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1119

✅ No Failures

joecummings commented Jun 25, 2024

janeyx99 left a comment

Choose a reason for hiding this comment

winglian commented Jul 1, 2024 • edited Loading

janeyx99 left a comment

Choose a reason for hiding this comment

janeyx99 Jul 1, 2024

Choose a reason for hiding this comment

mikaylagawarecki Jul 1, 2024

Choose a reason for hiding this comment

ebsmothers Jul 2, 2024

Choose a reason for hiding this comment

winglian Jul 5, 2024

Choose a reason for hiding this comment

janeyx99 left a comment

Choose a reason for hiding this comment

ebsmothers left a comment • edited Loading

Choose a reason for hiding this comment

ebsmothers Jul 2, 2024

Choose a reason for hiding this comment

pytorch-bot bot commented Jun 25, 2024 •

edited

Loading

winglian commented Jul 1, 2024 •

edited

Loading

ebsmothers left a comment •

edited

Loading