Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt to prompt editing #7

Open
andreaferretti opened this issue Jan 25, 2023 · 3 comments
Open

Prompt to prompt editing #7

andreaferretti opened this issue Jan 25, 2023 · 3 comments

Comments

@andreaferretti
Copy link

I am trying to understand to what extent your method replaces prompt-to-prompt. It seems to me that EDICT is a clever way to invert DDIM diffusion. If so, once we get our latents, we should be able to apply prompt to prompt editing techniques. Instead, what you propose is just to run DDIM denoising conditioned on the target prompt to obtain the edited image.

It has been observed that (on generated images) prompt-to-prompt obtains more realistic and semantically meaningful edits. I guess the technique should be readily applicable to a latent obtained by EDICT inversion - and the code seems to support it - but the paper does not mention this combination, and in fact setting use_p2p=True gives me inferior results.

Do you have an explanation why using prompt-to-prompt is not beneficial?

@bram-w
Copy link
Contributor

bram-w commented Mar 15, 2023

Hi, thanks for the question!

This surprised us too. I love P2P and thought it'd boost our results. We don't have a full explanation for it (we haven't focused on this a ton but I've dug through the code to double-check that things are wired correctly), but typically what I see with EDICT+P2P is some combo of

  1. The image remains overly faithful to the original

  2. The image becomes unrealistic

  3. Is a bit easier to explain imo. As we show in Figure 4 in the paper, the generative process can be delicate to perturbations; that's why we need averaging layers. It's fairly intuitive that putting another constraint on the process could mess things up.

The puzzling thing about 1. is that P2P clearly works in something like null-text inversion. So again it must be something EDICT-specific. One hypothesis is that the combination of the averaging layers with predictions operating on the counterpart sequence (e.g. x to y) dampens the amount of change that can be made when attention maps are constrained. It definitely makes the concept of self-attention more awkward.

It's possible that softening the locking of attention maps to re-weighting or being more selective in their application (or customizing them to EDICT) could work. This definitely is an area we want to keep thinking about so I'm curious if you have any further insight (experimental or otherwise). Happy to have follow-up discussions!

@andreaferretti
Copy link
Author

After some more experiments, I start finding the P2P interface too restrictive for general use, so I am not sure I would use that with EDICT even if it was available. Putting just any target prompt is so much more convenient.

Anyway, I don't have good explanations. I actually rewrote the P2P part to use the official Prompt to Prompt implementation, but I never got any good results with that

@andreaferretti
Copy link
Author

I am sorry I can't, it is part of a proprietary codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants