Recover gracefully from VRAM out of memory errors #5793

lstein · 2024-02-24T17:19:13Z

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Yes
No, because: straightforward fix

Have you updated all relevant documentation?

Yes
No

Description

At least on my system, if the model manager runs out of VRAM while moving a model into the GPU, the partial model gets stuck in VRAM and can't easily be removed. This makes the model unusable, and uses precious VRAM.

I encountered this when playing with large language models on the same system, but I suspect it will also happen if a video game is being played. I tried various approaches to recover from this state, including clearing the vram cache, deleting the model object, and running garbage collection, but without success.

This PR avoids the issue by implementing a check for sufficient available VRAM before trying to move a model to a CUDA GPU. If there is insufficient room, it raises a torch.cuda.OutOfMemoryError. This message is propagated to the front end. If more VRAM becomes available later, invocations will begin to work again.

Note: This pull request is against main. The model manager code has changed a bit, so I'm making a separate PR for next.

Related Tickets & Documents

Related Issue #
Closes #

QA Instructions, Screenshots, Recordings

Launch InvokeAI web service and another application that uses a lot of GPU VRAM. For my testing, I used ollama with a large model loaded. Run a generation and see if it generates an out of memory error. Try this repeatedly - should get the same error each time. Now kill the other application to free up VRAM and try to generate an image. It should work!

Merge Plan

Can merge when approved.

Added/updated tests?

Yes
No : please replace this line with details on why tests
have not been included

[optional] Are there any post deployment tasks we need to perform?

hipsterusername

Should we not merge this against next only?

lstein · 2024-02-24T17:35:44Z

There's a separate request for next because the MM code changed a bit. If nobody has complained about this bug, it's probably safe to leave it unmerged to main.

psychedelicious · 2024-02-26T06:40:30Z

Closing as we aren't merging anything to main if we can help it

recover gracefully from VRAM out of memory errors

b3abc72

lstein requested review from blessedcoolant, GreggHelt2, brandonrising, RyanJDick and hipsterusername as code owners February 24, 2024 17:19

github-actions bot added python PRs that change python files backend PRs that change backend files labels Feb 24, 2024

hipsterusername approved these changes Feb 24, 2024

View reviewed changes

psychedelicious closed this Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover gracefully from VRAM out of memory errors #5793

Recover gracefully from VRAM out of memory errors #5793

lstein commented Feb 24, 2024 •

edited

Loading

hipsterusername left a comment

lstein commented Feb 24, 2024

psychedelicious commented Feb 26, 2024

Recover gracefully from VRAM out of memory errors #5793

Recover gracefully from VRAM out of memory errors #5793

Conversation

lstein commented Feb 24, 2024 • edited Loading

What type of PR is this? (check all applicable)

Have you discussed this change with the InvokeAI team?

Have you updated all relevant documentation?

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Merge Plan

Added/updated tests?

[optional] Are there any post deployment tasks we need to perform?

hipsterusername left a comment

Choose a reason for hiding this comment

lstein commented Feb 24, 2024

psychedelicious commented Feb 26, 2024

lstein commented Feb 24, 2024 •

edited

Loading