fix(backend): revert non-blocking device transfer #6624

psychedelicious · 2024-07-15T21:10:54Z

Summary

In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use.

This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe.

Intermittent black images on MPS devices - reported on discord and [bug]: black image when using IP adapter after updating developer install #6545, fixed with special handling in fix(backend): mps should not use non_blocking #6549.
Intermittent OOMs and black images on a P4000 GPU on Windows - reported in [bug]: VRAM not being released #6613, fixed in this commit.

On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU.

One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots.

Much safer is to fully revert non-blocking - which is what this change does.

Test script demonstrating CUDA -> CPU race condition

This script induces the race condition. The tensor is different immediately after a device transfer and after waiting a couple seconds for torhc to sync. For me, I reliably get an inconsistency on the second GPU -> CPU transfer with non_blocking=True.

import time

import torch

if __name__ == "__main__":
    seed = 0
    torch.cuda.empty_cache()
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    for (from_device, to_device) in (("cpu", "cuda"), ("cuda", "cpu")):
        x = torch.zeros(32, 256, 256, 256).to(from_device)
        y = torch.ones(32, 256, 256, 256).to(from_device)

        for non_blocking in (True, False):
            print(f"\033[94mfrom {from_device} to {to_device} non_blocking={non_blocking}\033[0m")
            for i in range(3):
                print(f"\033[96m{i}\033[0m")
                t = (x.max() + y.max()).to(torch.device(to_device), non_blocking=non_blocking)
                print("before waiting 2s:", t)
                old_t = t.clone()
                time.sleep(2.0)
                print("after waiting 2s:", t)
                if old_t != t:
                    print("\033[91minconsistency!\033[0m")

Related Issues / Discussions

Closes #6613

QA Instructions

I have tried these combinations of models and had no issues. I don't think this change introduces any changes to behaviours. It's a one-shot revert of #6490 and #6549.

SDXL

ControlNet
LoRA
TI
IP Adapter

SD1.5

ControlNet
LoRA
TI
IP Adapter

Merge Plan

We'll do a bugfix release with this once merged.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)

In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use. This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe. - Intermittent black images on MPS devices - reported on discord and #6545, fixed with special handling in #6549. - Intermittent OOMs and black images on a P4000 GPU on Windows - reported in #6613, fixed in this commit. On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU. One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots. Much safer is to fully revert non-locking - which is what this change does.

RyanJDick

Thank you for the thorough investigation.

The changes look good to me. I double-checked that there are no remaining references to non_blocking, and ran a quick smoke test.

psychedelicious requested review from lstein, blessedcoolant, brandonrising, RyanJDick and hipsterusername as code owners July 15, 2024 21:10

github-actions bot added python PRs that change python files backend PRs that change backend files labels Jul 15, 2024

psychedelicious mentioned this pull request Jul 15, 2024

chore: bump version to 4.2.6post1 #6625

Merged

3 tasks

RyanJDick approved these changes Jul 15, 2024

View reviewed changes

hipsterusername approved these changes Jul 15, 2024

View reviewed changes

psychedelicious merged commit 3834391 into main Jul 15, 2024
14 checks passed

psychedelicious deleted the psyche/fix/backend/revert-non-blocking branch July 15, 2024 22:59

RyanJDick mentioned this pull request Jul 26, 2024

[MM] Make controlnet image processors use MM cache system #6564

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): revert non-blocking device transfer #6624

fix(backend): revert non-blocking device transfer #6624

psychedelicious commented Jul 15, 2024 •

edited

Loading

RyanJDick left a comment

fix(backend): revert non-blocking device transfer #6624

fix(backend): revert non-blocking device transfer #6624

Conversation

psychedelicious commented Jul 15, 2024 • edited Loading

Summary

Test script demonstrating CUDA -> CPU race condition

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

RyanJDick left a comment

Choose a reason for hiding this comment

psychedelicious commented Jul 15, 2024 •

edited

Loading