Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): revert non-blocking device transfer #6624

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

psychedelicious
Copy link
Collaborator

@psychedelicious psychedelicious commented Jul 15, 2024

Summary

In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use.

This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe.

On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU.

One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots.

Much safer is to fully revert non-blocking - which is what this change does.

Test script demonstrating CUDA -> CPU race condition

This script induces the race condition. The tensor is different immediately after a device transfer and after waiting a couple seconds for torhc to sync. For me, I reliably get an inconsistency on the second GPU -> CPU transfer with non_blocking=True.

import time

import torch

if __name__ == "__main__":
    seed = 0
    torch.cuda.empty_cache()
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    for (from_device, to_device) in (("cpu", "cuda"), ("cuda", "cpu")):
        x = torch.zeros(32, 256, 256, 256).to(from_device)
        y = torch.ones(32, 256, 256, 256).to(from_device)

        for non_blocking in (True, False):
            print(f"\033[94mfrom {from_device} to {to_device} non_blocking={non_blocking}\033[0m")
            for i in range(3):
                print(f"\033[96m{i}\033[0m")
                t = (x.max() + y.max()).to(torch.device(to_device), non_blocking=non_blocking)
                print("before waiting 2s:", t)
                old_t = t.clone()
                time.sleep(2.0)
                print("after waiting 2s:", t)
                if old_t != t:
                    print("\033[91minconsistency!\033[0m")

Related Issues / Discussions

Closes #6613

QA Instructions

I have tried these combinations of models and had no issues. I don't think this change introduces any changes to behaviours. It's a one-shot revert of #6490 and #6549.

SDXL

  • ControlNet
  • LoRA
  • TI
  • IP Adapter

SD1.5

  • ControlNet
  • LoRA
  • TI
  • IP Adapter

Merge Plan

We'll do a bugfix release with this once merged.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use.

This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe.

- Intermittent black images on MPS devices - reported on discord and #6545, fixed with special handling in #6549.
- Intermittent OOMs and black images on a P4000 GPU on Windows - reported in #6613, fixed in this commit.

On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU.

One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots.

Much safer is to fully revert non-locking - which is what this change does.
@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files labels Jul 15, 2024
Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough investigation.

The changes look good to me. I double-checked that there are no remaining references to non_blocking, and ran a quick smoke test.

@psychedelicious psychedelicious merged commit 3834391 into main Jul 15, 2024
14 checks passed
@psychedelicious psychedelicious deleted the psyche/fix/backend/revert-non-blocking branch July 15, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files python PRs that change python files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: VRAM not being released
3 participants