Partial Loading PR1: Tidy ModelCache #7492

RyanJDick · 2024-12-23T16:46:33Z

Summary

This PR tidies up the model cache code in preparation for further refactoring to support partial loading of models onto the GPU. These code changes should not change the functional behavior in any way.

Changes:

Remove the ModelCacheBase class. ModelCache is the only implementation, so there is no benefit to the separate abstract class.
Split CacheRecord and CacheStats out into their own files.
Remove the ModelLocker class. This extra layer of indirection was not providing any benefit. Locking is now done directly with the ModelCache.
Tidy up relative imports that were contributing to circular import issues.
Pull the 'submodel' concern out of the ModelCache. The ModelCache should not need to be aware of the model manager submodel system.
Delete unused properties from the ModelCache (e.g. .lazy_offloading, .storage_device, etc.)

QA Instructions

I ran smoke tests with a variety of SD1, SDXL and FLUX models. No change to behavior is expected.

Merge Plan

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…d not be concerned with implementation details like the submodel_type.

…he and make a bunch of ModelCache properties/methods private.

…y().

…rom CPU to GPU (#7494) ## Summary This PR adds utilities to support partial loading of models from CPU to GPU. The new utilities are not yet being used by the ModelCache, so there should be no functional behavior changes in this PR. Detailed changes: - Add autocast modules that are designed to wrap common `torch.nn.Module`s and enable them to run with automatic device casting. E.g. a linear layer on the CPU can be executed with an input tensor on the GPU by streaming the weights to the GPU at runtime. - Add unit tests for the aforementioned autocast modules to verify that they work for all supported quantization formats (GGUF, BnB NF4, BnB LLM.int8()). - Add `CachedModelWithPartialLoad` and `CachedModelOnlyFullLoad` classes to manage partial loading at the model level. ## Alternative Implementations Several options were explored for supporting inference on partially-loaded models. The pros/cons of the explored options are summarized here for reference. In the end, wrapper modules were selected as the best overall solution for our use case. Option 1: Re-implement the .forward() methods of modules to add support for device conversions - This is the option implemented in this PR. - This approach is the most manual of the three, but as a result offers the broadest compatibility with unusual model types. It is manual in that we have to explicitly add support for all module types that we wish to support. Fortunately, the list of foundational module types is relatively small (e.g. the current set of implemented layers covers all but 0.04 MB of the full FLUX model.). Option 2: Implement a custom Tensor type that casts tensors to a `target_device` each time the tensor is used - This approach has the nice property that it is injected at the tensor level, and the model does not need to be modified in any way. - One challenge with this approach is handling interactions with other custom tensor types (e.g. GGMLTensor). This problem is solvable, but definitely introduces a layer of complexity. (There are likely to also be some similar issues with interactions with the BnB quantization, but I didn't get as far as testing BnB.) Option 3: Override the `__torch_function__` dispatch calls globally and cast all params to the execution device. - This approach is nice and simple: just apply a global context manager and all operations will happen on the compute device regardless of the device of the participating tensors. - Challenges: - Overriding the `__torch_function__` dispatch calls introduces some overhead even if the tensors are already on the correct device. - It is difficult to manage the autocasting context manager. E.g. it is tempting to apply it to the model's `.forward(...)` method, but we use some models with non-standard entrypoints. And we don't want to end up with nested autocasting context managers. - BnB applies quantization side effects when a param is moved to the GPU - this interacts in unexpected ways with a global context manager. ## QA Instructions Most of the changes in this PR should not impact active code, and thus should not cause any changes to behavior. The main risks come from bumping the bitsandbytes dependency and some minor modifications to the bitsandbytes quantization code. - [x] Regression test bitsandbytes NF4 quantization - [x] Regression test bitsandbytes LLM.int8() quantization - [x] Regression test on MacOS (to ensure that there are no lingering bitsandbytes import errors) I also tested the new utilities for inference on full models in another branch to validate that there were not major issues. This functionality will be tested more thoroughly in a future PR. ## Merge Plan - [x] #7492 should be merged first so that the target branch can be updated to main. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

…s, 3) model patching (#7500) ## Summary This PR is the third in a sequence of PRs working towards support for partial loading of models onto the compute device (for low-VRAM operation). This PR updates the LoRA patching code so that the following features can cooperate fully: - Partial loading of weights onto the GPU - Quantized layers / weights - Model patches (e.g. LoRA) Note that this PR does not yet enable partial loading. It adds support in the model patching code so that partial loading can be enabled in a future PR. ## Technical Design Decisions The layer patching logic has been integrated into the custom layers (via `CustomModuleMixin`) rather than keeping it in a separate set of wrapper layers, as before. This has the following advantages: - It makes it easier to calculate the modified weights on the fly and then reuse the normal forward() logic. - In the future, it makes it possible to pass original parameters that have been cast to the device down to the LoRA calculation without having to re-cast (but the current implementation hasn't fully taken advantage of this yet). ## Know Limitations 1. I haven't fully solved device management for patch types that require the original layer value to calculate the patch. These aren't very common, and are not compatible with some quantized layers, so leaving this for future if there's demand. 2. There is a small speed regression for models that have CPU bottlenecks. This seems to be caused by slightly slower method resolution on the custom layers sub-classes. The regression does not show up on larger models, like FLUX, that are almost entirely GPU-limited. I think this small regression is tolerable, but if we decide that it's not, then the slowdown can easily be reclaimed by optimizing other CPU operations (e.g. if we only sent every 2nd progress image, we'd see a much more significant speedup). ## Related Issues / Discussions - #7492 - #7494 ## QA Instructions Speed tests: - Vanilla SD1 speed regression - Before: 3.156s (8.78 it/s) - After: 3.54s (8.35 it/s) - Vanilla SDXL speed regression - Before: 6.23s (4.46 it/s) - After: 6.45s (4.31 it/s) - Vanilla FLUX speed regression - Before: 12.02s (2.27 it/s) - After: 11.91s (2.29 it/s) LoRA tests with default configuration: - [x] SD1: A handful of LoRA variants - [x] SDXL: A handful of LoRA variants - [x] flux non-quantized: multiple lora variants - [x] flux bnb-quantized: multiple lora variants - [x] flux ggml-quantized: muliple lora variants - [x] flux non-quantized: FLUX control LoRA - [x] flux bnb-quantized: FLUX control LoRA - [x] flux ggml-quantized: FLUX control LoRA LoRA tests with sidecar patching forced: - [x] SD1: A handful of LoRA variants - [x] SDXL: A handful of LoRA variants - [x] flux non-quantized: multiple lora variants - [x] flux bnb-quantized: multiple lora variants - [x] flux ggml-quantized: muliple lora variants - [x] flux non-quantized: FLUX control LoRA - [x] flux bnb-quantized: FLUX control LoRA - [x] flux ggml-quantized: FLUX control LoRA Other: - [x] Smoke testing of IP-Adapter, ControlNet All tests repeated on: - [x] cuda - [x] cpu (only test SD1, because larger models are prohibitively slow) - [x] mps (skipped FLUX tests, because my Mac doesn't have enough memory to run them in a reasonable amount of time) ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

…#7522) ## Summary This is an unplanned fix between PR3 and PR4 in the sequence of partial loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround' documented in #7513. In other words, to work around a flaw in the model cache API, this fix allows models to be loaded into VRAM _even if_ they have been dropped from the RAM cache. This PR also adds an info log each time that this workaround is hit. In a future PR (#7509), we will eliminate the places in the application code that are capable of triggering this condition. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7513 ## QA Instructions - Set RAM cache limit to a small value. E.g. `ram: 4` - Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB. This will trigger the error condition. - Before the fix, this test configuration would cause a `KeyError`. After the fix, we should see an info-level log explaining that the condition was hit, but that generation should continue successfully. ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR adds support for partial loading of models onto the GPU. This enables models to run with much lower peak VRAM requirements (e.g. full FLUX dev with 8GB of VRAM). The partial loading feature is enabled behind a new config flag: `enable_partial_loading=True`. This flag defaults to `False`. **Note about performance:** The `ram` and `vram` config limits are still applied when `enable_partial_loading=True` is set. This can result in significant slowdowns compared to the 'old' behaviour. Consider the case where the VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB model. When `enable_partial_loading=False`, we attempt to load the entire model into VRAM, and if it fits (no OOM error) then it will run at full speed. When `enable_partial_loading=True`, since we have the option to partially load the model we will only load 0.75 GB into VRAM and leave the remaining 7.25 GB in RAM. This will cause inference to be much slower than before. To workaround this, it is important that your `ram` and `vram` configs are carefully tuned. In a future PR, we will add the ability to dynamically set the RAM/VRAM limits based on the available memory / VRAM. ## Related Issues / Discussions - #7492 - #7494 - #7500 ## QA Instructions Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device: For all tests, we expect model memory to stay below 2 GB. Peak working memory will be higher. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=True`, and hack to force all models to load 10%, on CUDA device: - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=False`, `vram=30`: We expect no change in behaviour when `enable_partial_loading=False`. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Other platforms: - [x] No change in behavior on MPS, even if `enable_partial_loading=True`. - [x] No change in behavior on CPU-only systems, even if `enable_partial_loading=True`. ## Merge Plan - [x] Merge #7500 first, and change the target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

github-actions bot added api python PRs that change python files backend PRs that change backend files services PRs that change app services python-tests PRs that change python tests docs PRs that change docs labels Dec 23, 2024

RyanJDick mentioned this pull request Dec 23, 2024

Partial Loading PR2: Add utils to support partial loading of models from CPU to GPU #7494

Merged

8 tasks

RyanJDick marked this pull request as ready for review December 23, 2024 23:32

RyanJDick requested review from lstein, blessedcoolant, brandonrising, hipsterusername and psychedelicious as code owners December 23, 2024 23:32

hipsterusername approved these changes Dec 24, 2024

View reviewed changes

RyanJDick enabled auto-merge (rebase) December 24, 2024 14:21

RyanJDick disabled auto-merge December 24, 2024 14:22

RyanJDick added 9 commits December 24, 2024 14:23

Rip out ModelLockerBase.

e48dee4

Move CacheRecord out to its own file.

ce11a19

Move CacheStats to its own file.

83ea642

Remove ModelCacheBase.

e0bfa61

Rename model_cache_default.py -> model_cache.py.

d30a9ce

Pull get_model_cache_key(...) out of ModelCache. The ModelCache shoul…

a7c7299

…d not be concerned with implementation details like the submodel_type.

Move lock(...) and unlock(...) logic from ModelLocker to the ModelCac…

a39bcf7

…he and make a bunch of ModelCache properties/methods private.

Get rid of ModelLocker. It was an unnecessary layer of indirection.

7dc3e0f

(minor) Add TODO comment regarding the location of get_model_cache_ke…

55b13c1

…y().

RyanJDick force-pushed the ryan/model-offload-1-tidy branch from 6b2a3b2 to 55b13c1 Compare December 24, 2024 14:23

RyanJDick merged commit d3916db into main Dec 24, 2024
15 checks passed

RyanJDick deleted the ryan/model-offload-1-tidy branch December 24, 2024 14:30

RyanJDick mentioned this pull request Dec 29, 2024

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Merged

24 tasks

RyanJDick mentioned this pull request Dec 31, 2024

Partial Loading PR4: Enable partial loading (behind config flag) #7505

Merged

25 tasks

This was referenced Jan 2, 2025

Partial Loading PR5: Dynamic cache ram/vram limits #7509

Merged

Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache #7522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Loading PR1: Tidy ModelCache #7492

Partial Loading PR1: Tidy ModelCache #7492

RyanJDick commented Dec 23, 2024 •

edited

Loading

Partial Loading PR1: Tidy ModelCache #7492

Partial Loading PR1: Tidy ModelCache #7492

Conversation

RyanJDick commented Dec 23, 2024 • edited Loading

Summary

QA Instructions

Merge Plan

Checklist

RyanJDick commented Dec 23, 2024 •

edited

Loading