Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to add the support of lazy model initialization. This is one of the two steps to lower the CPU memory usage for quantized models. Quantization is currently implemented by replacing regular linear layers with quantized linear layers. Without lazy init, the full-precision model before replacement results in a huge peak memory usage, making both training and inference hard to run on commodity hardware even with aggressive quantization: For example, the 4-bit 13B, which theoretically only needs 6.5GB of memory and fits comfortably in any mainstream PC, now requires 52GB of memory (full precision model and full precision checkpoint); and the 4-bit 70B model, which theoretically needs 35GB of memory and fits in two 3090s, now requires 280GB of memory which is only possible on some expensive HEDT and server platforms.
With lazy init, the model creation steps become: (1) create a placeholder model without allocating any actual storage, (2) replace layers with quantized ones and (3) instantiate all tensors. In this way, we need not manually re-implement a quantized version for each (current or future) model, and only the amount of storage after quantization is actually allocated.
However, supporting lazy init turns out to be a complicated task, as PyTorch essentially provides no good way to decouple model creation and weight initialization at this moment. Despite that tensors can be created as meta, there seems to be no reliable way to initialize them afterwards: The fairscale layers tend to initialize the weights in
__init__
and simply do not provide a separate method to initialize the weights after creation; and even if most PyTorch built-in layers do providereset_parameter
methods as of v2.0.1, they usually do not support custom initialization (e.g., LoRA needs zero init, buttorch.nn.Linear.reset_parameters
always initializes the weights randomly following a uniform distribution).Facing such a dilemma, I am trying to follow the lazy init implementation of PyTorch FSDP: Relying on the module's
reset_parameter
method for each module containing directly managed parameters and buffers, with the heavy lifting left to implementing thereset_parameter
for each module we used but do not have a working one in all cases.The model creation process is supposed to be like the following after the change:
Following this plan, the proposed code change is roughly organized into the following parts:
reset_parameter
method.default_tensor_type
to support meta tensor creation. Disable meta tensor creation around visual backbones (for loading their weights; it may be problematic for large vision models for which we may discuss later).reset_parameters
).reset_parameters
implemented; change the training / inference entry scripts to use the new model creation logic.This PR is going to involve an extensive code refactor and need thorough testings so mark it as draft for now.