Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load single-file checkpoints directly without conversion #6510

Merged
merged 19 commits into from
Jun 27, 2024

Conversation

lstein
Copy link
Collaborator

@lstein lstein commented Jun 12, 2024

Summary

This PR takes advantage of the new diffusers load_single_file() method to load single-file checkpoints (e.g. .safetensors) directly. It completely removes the convert cache directory, avoiding model duplication and conserving the user's disk space. Single file loaders for main models, VAEs, and controlnets have been implemented.

Manual conversion from single files to diffusers directories is still supported. As before, this is a one-way operation which deletes the original .safetensors file (if it is in the models directory).

Loading is nice and fast with no discernible difference in rendering performance. The main drawback of single file loading is that the entire pipeline is loaded into RAM at once, causing a surge in RAM usage compared to loading a diffusers submodel. However, after loading all pipeline components into RAM, its submodels are cached individually, and subsequent load calls will return the cached submodel as long as it remains cached. There may be a performance degradation if the user has set the RAM cache size to be too small for a whole pipeline to fit into. Under these circumstances a single-file pipeline will be re-read from disk each time it is needed.

It might be a good idea for the manager to produce a warning message when the cache size is too small to hold a whole pipeline. Let me know if you think this should be implemented.

Related Issues / Discussions

See the Discord thread starting at: https://discord.com/channels/1020123559063990373/1049495067846524939/1239774642449813514

QA Instructions

  1. Be aware that this PR will delete the models/.convert_cache directory, migrate the database schema to v12, and upgrade the config schema to 4.0.2. These migrations remove the convert cache directory and the config_cache size configuration option. It does not remove the config_cache_dir configuration option because the database migration needs to know where the user stored the converted files in order to delete them!
  2. Confirm that single-file model checkpoints and VAEs load properly.
  3. Confirm that the "Convert" button in the model manager UI works as expected.
  4. Confirm that no issues arise when switching among models or setting the ram configuration option to a low value.

Merge Plan

Squash merge when approved.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@github-actions github-actions bot added api python PRs that change python files backend PRs that change backend files services PRs that change app services python-tests PRs that change python tests labels Jun 12, 2024
Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass over the code. I still need to find some time to test it all.


The issue that you called out where the DB migration requires the convert_cache_path config value is quite awkward. Generally, this will be an issue anytime a migration depends on a config value.

A few general strategies come to mind:

  1. If a migration references a config value, then to be rigorous we should never remove that config value. Of course, we could remove it as part of a hard cutoff at some point in the future.
  2. Use our current migration pattern to handle both config / DB migrations in the same place. I.e. it would become more of an app migration system rather than a DB migration system. We already do some file cleanup in our migrations, so it's not a huge stretch to handle config migrations there as well. I haven't looked at the implications for the app launch logic though.
  3. Instruct users on how to manually remove their convert cache rather than doing it in a migration.

What do you think about all of this? I think I saw a PR at one point where you were exploring config migration patterns, so you have probably spent more time thinking about this than me.

@lstein
Copy link
Collaborator Author

lstein commented Jun 15, 2024

What do you think about all of this? I think I saw a PR at one point where you were exploring config migration patterns, so you have probably spent more time thinking about this than me.

At one point I had written a system for registering config migrations using decorators in which each migration step was executed by a separate function decorated with from and to versions. However, psychedelicious and I were unable to agree on the architecture for this and I withdrew the PR. The details are tedious, but check conversation at PR #6243

@psychedelicious
Copy link
Collaborator

Maybe we should handle config migration in the DB migrator since it's doing more than DB migrations now? Prefer to work through taht separate to this PR tho

Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I review again, I'm realizing that we don't really have a strategy with config version numbers. It seems like it was in-sync with app versions at one point. Now it's neither in-sync nor following semver.

Given this plus the awkwardness with handling convert_cache_dir, maybe we should just leave both convert_cache and convert_cache_dir as deprecated fields, and figure out a proper solution in a future PR. (I'm also not opposed to removingconvert_cache now - it doesn't make the situation any worse.)

invokeai/app/services/config/config_default.py Outdated Show resolved Hide resolved
invokeai/app/services/config/config_default.py Outdated Show resolved Hide resolved
@lstein
Copy link
Collaborator Author

lstein commented Jun 18, 2024

As I review again, I'm realizing that we don't really have a strategy with config version numbers. It seems like it was in-sync with app versions at one point. Now it's neither in-sync nor following semver.

Given this plus the awkwardness with handling convert_cache_dir, maybe we should just leave both convert_cache and convert_cache_dir as deprecated fields, and figure out a proper solution in a future PR. (I'm also not opposed to removingconvert_cache now - it doesn't make the situation any worse.)

I agree. We need to work out a system for coordinating the app, config file, and database schema version numbers. We should probably implement a single migration system that is adapted for config schema, db schema and root filesystem migrations.

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

@lstein lstein requested a review from RyanJDick June 18, 2024 17:45
@lstein lstein enabled auto-merge (squash) June 19, 2024 14:26
@RyanJDick
Copy link
Collaborator

I did a bunch of manual testing. Here are the results.

Test plan

  1. Before upgrade, I installed the following single-file models:
  1. I used all of the models listed above to cause them to be converted to diffusers format and written to the convert cache.
  2. I added convert_cache_dir and convert_cache explicitly to the config file to test how they are handled in the migration.
  3. Upgrade to this branch.
  4. Confirmed that the config file is updated as expected.
  5. Confirmed that the convert cache was deleted.
  6. Confirmed that all of the single-file models still work.
  7. Checked for reproducibility by re-running with the same params as before the upgrade (see issues below).
  8. Tested for RAM cache thrashing by experimenting with different cache sizes. As expected, there is some thrashing at low cache sizes. Anecdotally, at 4GB there was a lot of thrashing, at 7.5 GB things ran pretty smoothly. I didn't try to get more granular than that.
  9. Deleted and re-added the single-file models after the upgrade to confirm that new single file models can be added without issues.
  10. Manually ran a conversion on each of the single file models.
  11. Confirmed that the converted models still work.

Issues encountered that were not introduced by this PR:

  • I tried to test with this SDXL VAE: fix-fp16-errors-sdxl-lower-memory-use-sdxl-vae-fp16-fix-by-madebyollin But, it was incorrectly recognized as a SD1.5 VAE by the model probe. (I suppose that's expected since the SD1.5 and SDXL VAE have the same model architecture.) This might be fixed by [MM] Add support for probing and loading SDXL VAE checkpoint files #6524 (comment), which was merged after I ran these tests.
  • I tried to test with this SDXL ControlNet: ttplanetSDXLControlnet_v20Fp16. But, it failed conversion. Upon inspection, the weights appear to already be in diffusers format, but we attempted to convert them again.
  • I tried to test with this SDXL ControlNet: qrpatternSdxl_v01256512eSdxl. But, model import failed with error: Unable to determine model type for /home/ryanjdick3/Downloads/qrpatternSdxl_v01256512eSdxl.safetensors.
  • Clicking "Use All" does not clear the currently-selected VAE if the correct VAE is the default VAE.
  • I installed the SDXL IP-Adapter from the "Starter Models" section of the models tab. I still got this warning the first time that I used it: The image encoder required by this IP Adapter (ip_adapter_sd_image_encoder) is not installed. Downloading and installing now. This may take a while..
  • We allow users to click "Convert" for all model types, but then we fail with The model with key xxx is not a main checkpoint model. We should just hide the button if it's not supported.

Issues introduced by this PR:

There is a reproducibility problem with some models before/after this PR.
Here are two images generated with the same configs and the CyberRealistic_V4.1_FP16 model (one of the starter models).

Before:
before

After:
after

This issue did not seem to affect all models.

@RyanJDick
Copy link
Collaborator

I just noticed that we are still on diffusers==0.27.2 in this branch. In 0.28.0, diffusers did a "Massive Refactor of from_single_file" (https://github.com/huggingface/diffusers/releases/tag/v0.28.0). I feel like we should be building this branch against >=0.28.0 - especially since we are going to want to bump diffusers soon to get SD3 support.

@RyanJDick
Copy link
Collaborator

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

Agreed, for the reason you've described, I think at the very least we'd want an internal app version separate from the app release version. There might be reasons for further splitting which systems gets versioned together, but I'll save that discussion for another place.

@lstein
Copy link
Collaborator Author

lstein commented Jun 21, 2024

I just noticed that we are still on diffusers==0.27.2 in this branch. In 0.28.0, diffusers did a "Massive Refactor of from_single_file" (https://github.com/huggingface/diffusers/releases/tag/v0.28.0). I feel like we should be building this branch against >=0.28.0 - especially since we are going to want to bump diffusers soon to get SD3 support.

There are some significant changes involved in moving to 0.28.0, and these have gone into PR #6512 for separate review. Now that you've done all this testing I'm not comfortable including them in this PR. For example, with 0.28 and higher, the ControlNetModel defined in invokeai.backend.util.hotfixes can no longer include FromOriginalControlNetMixin as a base class because of an apparent dangling import internal to the diffusers package. In the SD3 PR, I removed this base class without apparent consequences, but this needs more extensive testing.

@lstein lstein force-pushed the lstein/feat/load-one-file branch from 3dbcf9a to 5c8cf99 Compare June 21, 2024 02:28
@lstein
Copy link
Collaborator Author

lstein commented Jun 21, 2024

Issues encountered that were not introduced by this PR:
I tried to test with this SDXL VAE: fix-fp16-errors-sdxl-lower-memory-use-sdxl-vae-fp16-fix-by-madebyollin But, it was incorrectly recognized as a SD1.5 VAE by the model probe. (I suppose that's expected since the SD1.5 and SDXL VAE have the same model architecture.) This might be fixed by #6524 (comment), which was merged after I ran these tests.

With that PR, this SDXL VAE is recognized and loaded. However, it looks washed out and I don't know if there's something wrong with the model or with the loading. I also tried with the older diffusers conversion method, and got the same washed out result, so I tend to think it is a model problem.

I tried to test with this SDXL ControlNet: ttplanetSDXLControlnet_v20Fp16. But, it failed conversion. Upon inspection, the weights appear to already be in diffusers format, but we attempted to convert them again.

Yes, it is in diffusers format. To load it we would need to put it into a folder, rename it diffusion_pytorch_model.safetensors and add a config file to the directory. This probably has to be done in the installer. Do you think it's worth it?

I tried to test with this SDXL ControlNet: qrpatternSdxl_v01256512eSdxl. But, model import failed with error: Unable to determine model type for /home/ryanjdick3/Downloads/qrpatternSdxl_v01256512eSdxl.safetensors.

This seems to be a controlnet format that I haven't run into yet. It will need a bit of new backend support: e.g. https://github.com/kohya-ss/ControlNet-LLLite-ComfyUI. There are a total of five LLLite controlnet models on civitai. It looks kinda' nice. Do you think we implement support, or wait until someone makes a feature request..

Clicking "Use All" does not clear the currently-selected VAE if the correct VAE is the default VAE.

I noticed that sometimes I didn't get the right VAE when remixing. Thanks for figuring out the way to reproduce the problem. I'll make this a bug report.

I installed the SDXL IP-Adapter from the "Starter Models" section of the models tab. I still got this warning the first time that I used it: The image encoder required by this IP Adapter (ip_adapter_sd_image_encoder) is not installed. Downloading and installing now. This may take a while..

I was unable to reproduce this. However, I think there is a problem at the frontend level in which the human-readable name and description of starter models are not passed to the intsaller, so that you get the automatically-inferred model name and description.

We allow users to click "Convert" for all model types, but then we fail with The model with key xxx is not a main checkpoint model. We should just hide the button if it's not supported.

Indeed. I'll make this a bug report.

@RyanJDick RyanJDick disabled auto-merge June 21, 2024 12:53
@lstein
Copy link
Collaborator Author

lstein commented Jun 21, 2024

@RyanJDick Are there any remaining issues?

@lstein lstein requested a review from RyanJDick June 21, 2024 19:39
@lstein
Copy link
Collaborator Author

lstein commented Jun 21, 2024

What would happen if we have a rule that we use the app version for the config and db schemas? The main complication would be that each time we make a breaking schema change, we have to update the app version within the PR. Given that PRs are merged at different times and orders, this would be a management headache.

Agreed, for the reason you've described, I think at the very least we'd want an internal app version separate from the app release version. There might be reasons for further splitting which systems gets versioned together, but I'll save that discussion for another place.

I think the coordinated config, filesystem and db migration system needs its own PR.

@lstein lstein merged commit 3e0fb45 into main Jun 27, 2024
14 checks passed
@lstein lstein deleted the lstein/feat/load-one-file branch June 27, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api backend PRs that change backend files python PRs that change python files python-tests PRs that change python tests services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants