fix(backend): mps should not use `non_blocking` #6549

psychedelicious · 2024-06-27T09:18:53Z

Summary

We can get black outputs when moving tensors from CPU to MPS. It appears MPS to CPU is fine. See:

Changes:

Add properties for each device on TorchDevice as a convenience.
Add get_non_blocking static method on TorchDevice. This utility takes a torch device and returns the flag to be used for non_blocking when moving a tensor to the device provided.
Update model patching and caching APIs to use this new utility.

Related Issues / Discussions

Fixes: #6545

QA Instructions

For both MPS and CUDA:

Generate at least 5 images using LoRAs
Generate at least 5 images using IP Adapters

Merge Plan

We have pagination merged into main but aren't ready for that to be released.

Once this fix is tested and merged, we will probably want to create a v4.2.5post1 branch off the v4.2.5 tag, cherry-pick the fix and do a release from the hotfix branch.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable) @RyanJDick @lstein This feels testable but I'm not sure how.
Documentation added / updated (if applicable)

We can get black outputs when moving tensors from CPU to MPS. It appears MPS to CPU is fine. See: - pytorch/pytorch#107455 - https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/28 Changes: - Add properties for each device on `TorchDevice` as a convenience. - Add `get_non_blocking` static method on `TorchDevice`. This utility takes a torch device and returns the flag to be used for non_blocking when moving a tensor to the device provided. - Update model patching and caching APIs to use this new utility. Fixes: #6545

RyanJDick

The changes make sense to me.

I ran the following tests:

On CUDA
- Generated a handful of images with LoRA and IP-Adapter. Everything looked good. I confirmed that non_blocking=True was still being applied.
On MPS
- Confirmed that I could reproduce the problem on main
- Upgraded to this branch. Generated 5 images with LoRA + IP-Adapter. No issues.

In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use. This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe. - Intermittent black images on MPS devices - reported on discord and #6545, fixed with special handling in #6549. - Intermittent OOMs and black images on a P4000 GPU on Windows - reported in #6613, fixed in this commit. On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU. One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots. Much safer is to fully revert non-locking - which is what this change does.

psychedelicious requested review from lstein, blessedcoolant, brandonrising, RyanJDick and hipsterusername as code owners June 27, 2024 09:18

github-actions bot added python PRs that change python files backend PRs that change backend files labels Jun 27, 2024

ruff format

14775cc

RyanJDick approved these changes Jun 27, 2024

View reviewed changes

RyanJDick merged commit aba1608 into main Jun 27, 2024
14 checks passed

RyanJDick deleted the psyche/fix/mps-black-image branch June 27, 2024 14:11

psychedelicious mentioned this pull request Jul 15, 2024

fix(backend): revert non-blocking device transfer #6624

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): mps should not use `non_blocking` #6549

fix(backend): mps should not use `non_blocking` #6549

psychedelicious commented Jun 27, 2024

RyanJDick left a comment

fix(backend): mps should not use non_blocking #6549

fix(backend): mps should not use non_blocking #6549

Conversation

psychedelicious commented Jun 27, 2024

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

RyanJDick left a comment

Choose a reason for hiding this comment

fix(backend): mps should not use `non_blocking` #6549

fix(backend): mps should not use `non_blocking` #6549