Load single-file checkpoints directly without conversion #6510

lstein · 2024-06-12T21:25:35Z

Summary

This PR takes advantage of the new diffusers load_single_file() method to load single-file checkpoints (e.g. .safetensors) directly. It completely removes the convert cache directory, avoiding model duplication and conserving the user's disk space. Single file loaders for main models, VAEs, and controlnets have been implemented.

Manual conversion from single files to diffusers directories is still supported. As before, this is a one-way operation which deletes the original .safetensors file (if it is in the models directory).

Loading is nice and fast with no discernible difference in rendering performance. The main drawback of single file loading is that the entire pipeline is loaded into RAM at once, causing a surge in RAM usage compared to loading a diffusers submodel. However, after loading all pipeline components into RAM, its submodels are cached individually, and subsequent load calls will return the cached submodel as long as it remains cached. There may be a performance degradation if the user has set the RAM cache size to be too small for a whole pipeline to fit into. Under these circumstances a single-file pipeline will be re-read from disk each time it is needed.

It might be a good idea for the manager to produce a warning message when the cache size is too small to hold a whole pipeline. Let me know if you think this should be implemented.

Related Issues / Discussions

See the Discord thread starting at: https://discord.com/channels/1020123559063990373/1049495067846524939/1239774642449813514

QA Instructions

Be aware that this PR will delete the models/.convert_cache directory, migrate the database schema to v12, and upgrade the config schema to 4.0.2. These migrations remove the convert cache directory and the config_cache size configuration option. It does not remove the config_cache_dir configuration option because the database migration needs to know where the user stored the converted files in order to delete them!
Confirm that single-file model checkpoints and VAEs load properly.
Confirm that the "Convert" button in the model manager UI works as expected.
Confirm that no issues arise when switching among models or setting the ram configuration option to a low value.

Merge Plan

Squash merge when approved.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)

…erformance is poor

RyanJDick

I did a first pass over the code. I still need to find some time to test it all.

The issue that you called out where the DB migration requires the convert_cache_path config value is quite awkward. Generally, this will be an issue anytime a migration depends on a config value.

A few general strategies come to mind:

If a migration references a config value, then to be rigorous we should never remove that config value. Of course, we could remove it as part of a hard cutoff at some point in the future.
Use our current migration pattern to handle both config / DB migrations in the same place. I.e. it would become more of an app migration system rather than a DB migration system. We already do some file cleanup in our migrations, so it's not a huge stretch to handle config migrations there as well. I haven't looked at the implications for the app launch logic though.
Instruct users on how to manually remove their convert cache rather than doing it in a migration.

What do you think about all of this? I think I saw a PR at one point where you were exploring config migration patterns, so you have probably spent more time thinking about this than me.

invokeai/app/api/routers/model_manager.py

invokeai/app/services/config/config_default.py

invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py

lstein · 2024-06-15T23:04:00Z

What do you think about all of this? I think I saw a PR at one point where you were exploring config migration patterns, so you have probably spent more time thinking about this than me.

At one point I had written a system for registering config migrations using decorators in which each migration step was executed by a separate function decorated with from and to versions. However, psychedelicious and I were unable to agree on the architecture for this and I withdrew the PR. The details are tedious, but check conversation at PR #6243

…sion.py Co-authored-by: Ryan Dick <[email protected]>

…keAI into lstein/feat/load-one-file

psychedelicious · 2024-06-16T02:26:07Z

Maybe we should handle config migration in the DB migrator since it's doing more than DB migrations now? Prefer to work through taht separate to this PR tho

RyanJDick

As I review again, I'm realizing that we don't really have a strategy with config version numbers. It seems like it was in-sync with app versions at one point. Now it's neither in-sync nor following semver.

Given this plus the awkwardness with handling convert_cache_dir, maybe we should just leave both convert_cache and convert_cache_dir as deprecated fields, and figure out a proper solution in a future PR. (I'm also not opposed to removingconvert_cache now - it doesn't make the situation any worse.)

invokeai/app/services/config/config_default.py

lstein · 2024-06-18T17:20:26Z

As I review again, I'm realizing that we don't really have a strategy with config version numbers. It seems like it was in-sync with app versions at one point. Now it's neither in-sync nor following semver.

Given this plus the awkwardness with handling convert_cache_dir, maybe we should just leave both convert_cache and convert_cache_dir as deprecated fields, and figure out a proper solution in a future PR. (I'm also not opposed to removingconvert_cache now - it doesn't make the situation any worse.)

I agree. We need to work out a system for coordinating the app, config file, and database schema version numbers. We should probably implement a single migration system that is adapted for config schema, db schema and root filesystem migrations.

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

…d-one-file

RyanJDick · 2024-06-20T19:45:14Z

I did a bunch of manual testing. Here are the results.

Test plan

Before upgrade, I installed the following single-file models:

SD1.5 VAE: https://civitai.com/models/22354/clearvaesd15?modelVersionId=88156
SD1.5 ControlNet: https://civitai.com/models/90940/controlnet-qr-pattern-qr-codes?modelVersionId=111973
SD1.5 Base: https://civitai.com/models/4201/realistic-vision-v60-b1?modelVersionId=130072
SDXL VAE: Not tested, see issues.
SDXL ControlNet: Not tested, see issues.
SDXL Base: https://civitai.com/models/112902/dreamshaper-xl?modelVersionId=126688

I used all of the models listed above to cause them to be converted to diffusers format and written to the convert cache.
I added convert_cache_dir and convert_cache explicitly to the config file to test how they are handled in the migration.
Upgrade to this branch.
Confirmed that the config file is updated as expected.
Confirmed that the convert cache was deleted.
Confirmed that all of the single-file models still work.
Checked for reproducibility by re-running with the same params as before the upgrade (see issues below).
Tested for RAM cache thrashing by experimenting with different cache sizes. As expected, there is some thrashing at low cache sizes. Anecdotally, at 4GB there was a lot of thrashing, at 7.5 GB things ran pretty smoothly. I didn't try to get more granular than that.
Deleted and re-added the single-file models after the upgrade to confirm that new single file models can be added without issues.
Manually ran a conversion on each of the single file models.
Confirmed that the converted models still work.

Issues encountered that were not introduced by this PR:

I tried to test with this SDXL VAE: fix-fp16-errors-sdxl-lower-memory-use-sdxl-vae-fp16-fix-by-madebyollin But, it was incorrectly recognized as a SD1.5 VAE by the model probe. (I suppose that's expected since the SD1.5 and SDXL VAE have the same model architecture.) This might be fixed by [MM] Add support for probing and loading SDXL VAE checkpoint files #6524 (comment), which was merged after I ran these tests.
I tried to test with this SDXL ControlNet: ttplanetSDXLControlnet_v20Fp16. But, it failed conversion. Upon inspection, the weights appear to already be in diffusers format, but we attempted to convert them again.
I tried to test with this SDXL ControlNet: qrpatternSdxl_v01256512eSdxl. But, model import failed with error: Unable to determine model type for /home/ryanjdick3/Downloads/qrpatternSdxl_v01256512eSdxl.safetensors.
Clicking "Use All" does not clear the currently-selected VAE if the correct VAE is the default VAE.
I installed the SDXL IP-Adapter from the "Starter Models" section of the models tab. I still got this warning the first time that I used it: The image encoder required by this IP Adapter (ip_adapter_sd_image_encoder) is not installed. Downloading and installing now. This may take a while..
We allow users to click "Convert" for all model types, but then we fail with The model with key xxx is not a main checkpoint model. We should just hide the button if it's not supported.

Issues introduced by this PR:

There is a reproducibility problem with some models before/after this PR.
Here are two images generated with the same configs and the CyberRealistic_V4.1_FP16 model (one of the starter models).

Before:

After:

This issue did not seem to affect all models.

invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py

RyanJDick · 2024-06-20T21:45:49Z

I just noticed that we are still on diffusers==0.27.2 in this branch. In 0.28.0, diffusers did a "Massive Refactor of from_single_file" (https://github.com/huggingface/diffusers/releases/tag/v0.28.0). I feel like we should be building this branch against >=0.28.0 - especially since we are going to want to bump diffusers soon to get SD3 support.

RyanJDick · 2024-06-20T21:49:44Z

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

Agreed, for the reason you've described, I think at the very least we'd want an internal app version separate from the app release version. There might be reasons for further splitting which systems gets versioned together, but I'll save that discussion for another place.

lstein · 2024-06-21T01:08:05Z

I just noticed that we are still on diffusers==0.27.2 in this branch. In 0.28.0, diffusers did a "Massive Refactor of from_single_file" (https://github.com/huggingface/diffusers/releases/tag/v0.28.0). I feel like we should be building this branch against >=0.28.0 - especially since we are going to want to bump diffusers soon to get SD3 support.

There are some significant changes involved in moving to 0.28.0, and these have gone into PR #6512 for separate review. Now that you've done all this testing I'm not comfortable including them in this PR. For example, with 0.28 and higher, the ControlNetModel defined in invokeai.backend.util.hotfixes can no longer include FromOriginalControlNetMixin as a base class because of an apparent dangling import internal to the diffusers package. In the SD3 PR, I removed this base class without apparent consequences, but this needs more extensive testing.

lstein · 2024-06-21T02:45:09Z

Issues encountered that were not introduced by this PR:
I tried to test with this SDXL VAE: fix-fp16-errors-sdxl-lower-memory-use-sdxl-vae-fp16-fix-by-madebyollin But, it was incorrectly recognized as a SD1.5 VAE by the model probe. (I suppose that's expected since the SD1.5 and SDXL VAE have the same model architecture.) This might be fixed by #6524 (comment), which was merged after I ran these tests.

With that PR, this SDXL VAE is recognized and loaded. However, it looks washed out and I don't know if there's something wrong with the model or with the loading. I also tried with the older diffusers conversion method, and got the same washed out result, so I tend to think it is a model problem.

I tried to test with this SDXL ControlNet: ttplanetSDXLControlnet_v20Fp16. But, it failed conversion. Upon inspection, the weights appear to already be in diffusers format, but we attempted to convert them again.

Yes, it is in diffusers format. To load it we would need to put it into a folder, rename it diffusion_pytorch_model.safetensors and add a config file to the directory. This probably has to be done in the installer. Do you think it's worth it?

I tried to test with this SDXL ControlNet: qrpatternSdxl_v01256512eSdxl. But, model import failed with error: Unable to determine model type for /home/ryanjdick3/Downloads/qrpatternSdxl_v01256512eSdxl.safetensors.

This seems to be a controlnet format that I haven't run into yet. It will need a bit of new backend support: e.g. https://github.com/kohya-ss/ControlNet-LLLite-ComfyUI. There are a total of five LLLite controlnet models on civitai. It looks kinda' nice. Do you think we implement support, or wait until someone makes a feature request..

Clicking "Use All" does not clear the currently-selected VAE if the correct VAE is the default VAE.

I noticed that sometimes I didn't get the right VAE when remixing. Thanks for figuring out the way to reproduce the problem. I'll make this a bug report.

I installed the SDXL IP-Adapter from the "Starter Models" section of the models tab. I still got this warning the first time that I used it: The image encoder required by this IP Adapter (ip_adapter_sd_image_encoder) is not installed. Downloading and installing now. This may take a while..

I was unable to reproduce this. However, I think there is a problem at the frontend level in which the human-readable name and description of starter models are not passed to the intsaller, so that you get the automatically-inferred model name and description.

We allow users to click "Convert" for all model types, but then we fail with The model with key xxx is not a main checkpoint model. We should just hide the button if it's not supported.

Indeed. I'll make this a bug report.

lstein · 2024-06-21T19:38:42Z

@RyanJDick Are there any remaining issues?

lstein · 2024-06-21T19:41:25Z

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

Agreed, for the reason you've described, I think at the very least we'd want an internal app version separate from the app release version. There might be reasons for further splitting which systems gets versioned together, but I'll save that discussion for another place.

I think the coordinated config, filesystem and db migration system needs its own PR.

Lincoln Stein added 5 commits June 7, 2024 15:34

use model_class.load_singlefile() instead of converting; works, but p…

067b805

…erformance is poor

adjust the convert api - not right just yet

b268cc2

working, needs sql migrator update

acce4d3

rename migration_11 before conflict merge with main

7634991

migrate config file to remove convert_cache setting

379d02d

lstein requested review from blessedcoolant, brandonrising, RyanJDick, hipsterusername and psychedelicious as code owners June 12, 2024 21:25

github-actions bot added api python PRs that change python files backend PRs that change backend files services PRs that change app services python-tests PRs that change python tests labels Jun 12, 2024

RyanJDick reviewed Jun 14, 2024

View reviewed changes

lstein and others added 5 commits June 15, 2024 19:08

Update invokeai/backend/model_manager/load/model_loaders/stable_diffu…

6b788bf

…sion.py Co-authored-by: Ryan Dick <[email protected]>

Update invokeai/backend/model_manager/load/model_loaders/stable_diffu…

1411fbb

…sion.py Co-authored-by: Ryan Dick <[email protected]>

implement lightweight version-by-version config migration

17e9d4f

Merge branch 'lstein/feat/load-one-file' of github.com:invoke-ai/Invo…

e7b7737

…keAI into lstein/feat/load-one-file

Merge branch 'main' into lstein/feat/load-one-file

1109708

RyanJDick reviewed Jun 17, 2024

View reviewed changes

invokeai/app/services/config/config_default.py Outdated Show resolved Hide resolved

invokeai/app/services/config/config_default.py Outdated Show resolved Hide resolved

Lincoln Stein and others added 2 commits June 18, 2024 13:43

simplified config schema migration code

c87cad3

Merge branch 'main' into lstein/feat/load-one-file

3a622af

lstein requested a review from RyanJDick June 18, 2024 17:45

lstein enabled auto-merge (squash) June 19, 2024 14:26

Merge branch 'main' into lstein/feat/load-one-file

74f0c31

lstein mentioned this pull request Jun 20, 2024

[MM] Add support for probing and loading SDXL VAE checkpoint files #6524

Merged

3 tasks

Lincoln Stein added 3 commits June 19, 2024 23:43

associate sdxl config with sdxl VAEs

349239e

merge with main

2a4254c

Merge branch 'lstein/bugfix/sdxl-vae-conversion' into lstein/feat/loa…

b0574f8

…d-one-file

RyanJDick suggested changes Jun 20, 2024

View reviewed changes

invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py Outdated Show resolved Hide resolved

remove use of original_config_file in load_single_file()

5c8cf99

lstein force-pushed the lstein/feat/load-one-file branch from 3dbcf9a to 5c8cf99 Compare June 21, 2024 02:28

RyanJDick disabled auto-merge June 21, 2024 12:53

lstein requested a review from RyanJDick June 21, 2024 19:39

RyanJDick approved these changes Jun 25, 2024

View reviewed changes

lstein added 2 commits June 27, 2024 09:01

Merge branch 'main' into lstein/feat/load-one-file

9037b6f

Merge branch 'main' into lstein/feat/load-one-file

8888641

hipsterusername approved these changes Jun 27, 2024

View reviewed changes

lstein merged commit 3e0fb45 into main Jun 27, 2024
14 checks passed

lstein deleted the lstein/feat/load-one-file branch June 27, 2024 21:31

Load single-file checkpoints directly without conversion #6510

Load single-file checkpoints directly without conversion #6510

Uh oh!

Conversation

lstein commented Jun 12, 2024

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

RyanJDick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lstein commented Jun 15, 2024

Uh oh!

psychedelicious commented Jun 16, 2024

Uh oh!

RyanJDick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lstein commented Jun 18, 2024

Uh oh!

RyanJDick commented Jun 20, 2024

Test plan

Issues encountered that were not introduced by this PR:

Issues introduced by this PR:

Uh oh!

Uh oh!

RyanJDick commented Jun 20, 2024

Uh oh!

RyanJDick commented Jun 20, 2024

Uh oh!

lstein commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lstein commented Jun 21, 2024

Uh oh!

lstein commented Jun 21, 2024

Uh oh!

lstein commented Jun 21, 2024

Uh oh!

Uh oh!

Uh oh!

RyanJDick left a comment •

edited

Loading

lstein commented Jun 21, 2024 •

edited

Loading