Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Adapter cannot be applied directly as described on paper #25

Open
huchenlei opened this issue Feb 28, 2024 · 12 comments
Open

[Bug] Adapter cannot be applied directly as described on paper #25

huchenlei opened this issue Feb 28, 2024 · 12 comments

Comments

@huchenlei
Copy link

huchenlei commented Feb 28, 2024

In the paper, the author showed that the X-Adapter can be used to directly applied on SDXL checkpoint to denoise a pure latent noise and apply the plugable control condition (ControlNet, LoRA, etc).
image

However, according to my testing, this is not true. If I remove the sd15 pass, the SDXL model with SD15 ControlNet applied via adapter cannot produce img follow the control constraint.

Steps to reproduce

image

Is the adapter really making a difference?

I am suspecting if the xadapter is really doing its job on transfering the control of sd15 controlnet model to sdxl. If we are starting at T0 with the sd15 controlnet already applied with some steps, as long as the SDXL model not doing too much extra modification, the result should not deviate a lot from the T0 result.

Here I compare the result between 2 scenarios:

No Adapter

Add up_block_additional_residual = None after https://github.com/kijai/ComfyUI-Diffusers-X-Adapter/blob/81b77e441c2dd5566acf5025fa9cadfb8a574490/pipeline/pipeline_sd_xl_adapter_controlnet.py#L1166

Essentially this is doing generation with ControlNet using SD15 model, and then refine with SDXL model.
image
image

With Adapter

image
image

I do not observe clear improvement on the X-Adapter.

@Semolce9
Copy link
Collaborator

Hi. I test canny controlnet using following bash scripts:

python inference.py --plugin_type "controlnet" \
--prompt "A cute cat, high quality, extremely detailed" \
--condition_type "canny" \
--input_image_path "./assets/CuteCat.jpeg" \
--controlnet_condition_scale_list 1.5 \
--adapter_guidance_start_list 1.00 \
--adapter_condition_scale_list 1.00 \

I set adapter_guidance_start to 1.00 which means only SDXL denoises a pure noise under X-Adapter's guidance. Given the condition:
A cute cat_canny_condition
I get the result:
A cute cat_0_ccs_1 50_ags_1 00_acs_1 00
This result is consistent with our ablation study (fig.10). We also shows a suboptimal result in fig.10 and it indicates that we need both initial latents from SD1.5 and guidance from X-Adapter.

@huchenlei
Copy link
Author

In the code you are setting sd15 steps to be total num of inference steps here:
https://github.com/kijai/ComfyUI-Diffusers-X-Adapter/blob/81b77e441c2dd5566acf5025fa9cadfb8a574490/pipeline/pipeline_sd_xl_adapter_controlnet.py#L897-L901

So no matter what adapter_guidance_start value you set, sd15 pass will always run full inference steps.

Setting adapter_guidance_start = 1.0
image

Setting adapter_guidance_start = 0.5
image

@huchenlei
Copy link
Author

Also, I would like to see explanation on why the result's quality does not change much with/without adapter.

@Semolce9
Copy link
Collaborator

Semolce9 commented Feb 28, 2024

In the code you are setting sd15 steps to be total num of inference steps here: https://github.com/kijai/ComfyUI-Diffusers-X-Adapter/blob/81b77e441c2dd5566acf5025fa9cadfb8a574490/pipeline/pipeline_sd_xl_adapter_controlnet.py#L897-L901

So no matter what adapter_guidance_start value you set, sd15 pass will always run full inference steps.

Setting adapter_guidance_start = 1.0 image

Setting adapter_guidance_start = 0.5 image

Yes and I will add noise to SD1.5's output to timestep 50 which is pure noise. In other words I denoised for 50 steps and turned the result back 50 steps.

@huchenlei
Copy link
Author

Can you confirm the result if you completely remove sd15 pass code and use initial latent directly?

Comment out following code: https://github.com/kijai/ComfyUI-Diffusers-X-Adapter/blob/81b77e441c2dd5566acf5025fa9cadfb8a574490/pipeline/pipeline_sd_xl_adapter_controlnet.py#L993-L1078

@Semolce9
Copy link
Collaborator

It's improper to directly remove these code. You run prepare_xl_latents_from_sd_1_5 and sd1_5_add_noise without generating latents_sd1_5_prior. During training, sd1.5 and sdxl's latent are naturally aligned together so x-adapter can only work on situations where two latents are aligned together even the alignment is weak(adapter_guidance_start=1.00).

I set up_block_additional_residual = None, adapter_guidance_start = 1.00 and given condition:
A cute cat_canny_condition
result:
A cute cat_0_ccs_1 50_ags_1 00_acs_1 00
Using x-adapter, setting controlnet_condition_scale=1.75, adapter_guidance_start=1.00, adapter_condition_scale=1.00, I get following result:
A cute cat_0_ccs_1 75_ags_1 00_acs_1 00

If your convern lies in "why do the author mention we can directly using it?". The answer is that we want to make comparison between two methods in our paper and the result shows that we do not recommand users to directly apply x-adapter to sdxl since this is not what x-adapter is trained on.

@huchenlei
Copy link
Author

I think if the full sd15 pass is mandatory, the figure used in the paper is misleading. The figure is showing that the adapter can be directly applied on a random noise.

@Semolce9
Copy link
Collaborator

I agreed. Thank you for suggestion. I will clarify it and upload a new version to arxiv.

@huchenlei
Copy link
Author

I have intermidiate results dumped in ComfyUI. It seems like either the ComfyUI result is broken or the impl of ControlNet pipeline is broken? kijai#15

adapter_guidance_start = 1.0
image

adapter_guidance_start = 0.5
image

Please confirm that these result can be reproduced. Also, is there any requirement on what checkpoints to use? (SD15/SDXL)

@Semolce9
Copy link
Collaborator

If images from left to right, top to bottom is sd1.5's output, sdxl's output, sd1.5's input, sdxl's input, then the result is as expected. We can see better constraint with lower adapter_guidance_start. sd1.5 and sdxl's input is noise if you decode it to pixel space. If you want better result, please try different values of controlnet_condition_scale and adapter_condition_scale.

I cannot check all checkpoints since there are so many. I did experiments using official sd1.5 and sdxl developed by stability.ai. I saw some good cases using other checkpoints like https://twitter.com/ZHOZHO672070/status/1759516726125908361.

@light-and-ray
Copy link

I think it's enough to know that it requires full sd1.5 checkpoint. If it requires, it cannot be used just as "adapter", like loras, controlnets, ip-adapters, lcm loras etc :(

@LastTargaryen
Copy link

@light-and-ray, I have a strong feeling that this adapter is...unusable. It's a failed experiment. Should not have been released the way it has been. It's very misleading and gives the impression of being beneficial to the ML community and the SD-user community--but it's not. Not even for advanced users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants