Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo custom attention processor to support other attention types #6550

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

StAlKeR7779
Copy link
Contributor

@StAlKeR7779 StAlKeR7779 commented Jun 27, 2024

Summary

Current attention processor implements only torch-sdp attention type, so when any ip-adapter or regional prompt used, we override model to run torch-sdp attention.
New attention processor combines 4 attention processors(normal, sliced, xformers, torch-sdp) by moving parts of attention that differs(mask preparation and attention itself), to separate function call, where required implementation executed.

Related Issues / Discussions

None

QA Instructions

Change attention_type in invokeai.yaml and then run generation with ip-adapter or regional prompt.

Merge Plan

None?

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@dunkeroni @RyanJDick

@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files labels Jun 27, 2024
@StAlKeR7779 StAlKeR7779 marked this pull request as ready for review June 27, 2024 14:02
@RyanJDick
Copy link
Collaborator

I haven't looked at the code yet, but do you know if there are still use cases for using attention processors other than Torch 2.0 SDP? Based on the benchmarking that diffusers has done, it seems like the all around best choice. But maybe there are still reasons to use other implementation e.g. very-low-vram system?

@StAlKeR7779
Copy link
Contributor Author

I thought roughly same:
normal - generally no need in it
xformers - if you said that torch-sdp on par or even faster, then too can be removed
sliced - yes it's suitable for low memory situations, and I think it's main attention for mps

@psychedelicious
Copy link
Collaborator

On CUDA, torch's SDP was faster than xformers for me when I last checked a month or so back. IIRC it was just a couple % faster.

@RyanJDick
Copy link
Collaborator

I thought about this some more, and I'm hesitant to proceed with trying to merge this until we have more clarity around which attention implementations we actually want to support.

Right now, we have _adjust_memory_efficient_attention, which tries to configure attention based on the config and the system properties. The logic in this function is outdated, and I think there has been hesitation to change it out of fear of causing a regression on some systems. Let's get to the bottom of this, before deciding how to proceed with this PR.

My current guess is that just supporting torch SDP and sliced attention would cover all use cases. But, we need to do some testing to determine if this is accurate.

A few data points to consider:

@StAlKeR7779 do you want to look into this?

@github-actions github-actions bot added the invocations PRs that change invocations label Jul 27, 2024
@StAlKeR7779
Copy link
Contributor Author

StAlKeR7779 commented Jul 28, 2024

@RyanJDick ok, I removed normal and xformers attentions.
But some parts related to frontend, so I hope that @psychedelicious will look at it.
Also question is - what to do with config, I think old configs can be migrated, to convert normal/xformers values, but I not familiar with this part, will look closely later. Maybe @lstein can suggest how to do it.
Upd: Already added config migration, but welcome to hear if done smth wrong in it.

@github-actions github-actions bot added the services PRs that change app services label Jul 28, 2024
@StAlKeR7779 StAlKeR7779 requested a review from ebr as a code owner July 29, 2024 11:02
@github-actions github-actions bot added docker Root installer PRs that change the installer python-deps PRs that change python dependencies labels Jul 29, 2024
Copy link
Collaborator

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some smoke tests covering various permutations of attention type (normal vs torch-sdp) and slice size configs (none, max, balanced, 1, 3, 6) - all good.

@psychedelicious
Copy link
Collaborator

pls run scripts/update_config_docstring.py to update the config docstrings

Co-Authored-By: psychedelicious <[email protected]>
Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just did a basic smoke test - looks like others have done more rigorous testing.

A few minor things:

  • Delete _ignore_xformers_triton_message_on_windows
  • Delete logging.getLogger("xformers").addFilter(lambda record: "A matching Triton is not available" not in record.getMessage())
  • Remove xformers instructions from 020_INSTALL_MANUAL.md

Co-Authored-By: Ryan Dick <[email protected]>
@github-actions github-actions bot added the docs PRs that change docs label Aug 6, 2024
Copy link
Member

@ebr ebr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only requesting changes here to ensure we don't merge this before testing on a couple of older GPUs. will test it asap

@ebr
Copy link
Member

ebr commented Aug 7, 2024

How critical is it to remove xformers completely vs leaving it as an option?

torch-sdp is 4.8x slower on older generation Pascal GPUs, like P40 or P100 or the Nvidia 10xx GPUs. For SDXL, this means 8.8 vs 1.8 seconds/iter

Other than that, torch-sdp is slightly faster on Ampere and Turing. Likely also on Ada, but i haven't tested that one.

@StAlKeR7779
Copy link
Contributor Author

StAlKeR7779 commented Aug 7, 2024

As I said - it cost nothing to support it further in code. But looks like most peoples thought it should be removed, so I removed.
I can easily return it.
Should we select attention type based on cuda compute capability that starting from 7.5 default is torch-sdp and for older ones default xformers(if available)?
image

@hipsterusername
Copy link
Member

That seems reasonable, with a configurable override if user wants to force one.

Does look like we should add it back.

@ebr
Copy link
Member

ebr commented Aug 7, 2024

The PyTorch blog says Flash Attention is supported from sm80 compute capability onwards: https://pytorch.org/blog/accelerated-pytorch-2/, so perhaps we should default to xformers for anything lower than that. (As @hipsterusername said, with a configurable override).

@StAlKeR7779 StAlKeR7779 requested a review from ebr August 7, 2024 18:24
@RyanJDick
Copy link
Collaborator

Not for this PR, but I did some performance testing and we'll probably want to address this at some point:

SDXL:

>>> Time taken to prepare attention processors: 0.10069823265075684s
>>> Time taken to prepare attention processors: 0.07877492904663086s
>>> Time taken to set attention processors: 0.1278061866760254s
>>> Time taken to reset attention processors: 0.13225793838500977s

Code used to measure:

    def apply_custom_attention(self, unet: UNet2DConditionModel):
        """A context manager that patches `unet` with CustomAttnProcessor2_0 attention layers."""
        start = time.time()
        attn_procs = self._prepare_attention_processors(unet)
        time_1 = time.time()
        print(f">>> Time taken to prepare attention processors: {time_1 - start}s")
        orig_attn_processors = unet.attn_processors
        time_2 = time.time()
        print(f">>> Time taken to prepare attention processors: {time_2 - time_1}s")

        try:
            # Note to future devs: set_attn_processor(...) does something slightly unexpected - it pops elements from
            # the passed dict. So, if you wanted to keep the dict for future use, you'd have to make a
            # moderately-shallow copy of it. E.g. `attn_procs_copy = {k: v for k, v in attn_procs.items()}`.
            unet.set_attn_processor(attn_procs)
            time_3 = time.time()
            print(f">>> Time taken to set attention processors: {time_3 - time_2}s")
            yield None
        finally:
            time_4 = time.time()
            unet.set_attn_processor(orig_attn_processors)
            time_5 = time.time()
            print(f">>> Time taken to reset attention processors: {time_5 - time_4}s")

Copy link
Member

@ebr ebr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested after changes, seeing expected performance increases on Ampere and no performance degradation on Pascal. LGTM!!

Copy link
Collaborator

@RyanJDick RyanJDick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just about good-to-go. There are a couple minor requests for docs, but otherwise the code looks good to me. (I haven't tested all cases myself, but it sounds like others have.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api backend PRs that change backend files docker docs PRs that change docs frontend PRs that change frontend files installer PRs that change the installer invocations PRs that change invocations python PRs that change python files python-deps PRs that change python dependencies Root services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants