ComfyUI reference implementation for IPAdapter models.
IPAdapter implementation that follows the ComfyUI way of doing things. The code is memory efficient, fast, and shouldn't break with Comfy updates.
2023/12/22: Added support for FaceID models. Read the documentation for details.
2023/12/05: Added batch embeds
node. This lets you encode images in batches and merge them together into an IPAdapter Apply Encoded
node. Useful mostly for animations because the clip vision encoder takes a lot of VRAM. My suggestion is to split the animation in batches of about 120 frames.
2023/11/29: Added unfold_batch
option to send the reference images sequentially to a latent batch. Useful for animations.
2023/11/26: Added timestepping. You may need to delete the old nodes and recreate them. Important: For this to work you need to update ComfyUI to the latest version.
2023/11/24: Support for multiple attention masks.
2023/11/23: Small but important update: the new default location for the IPAdapter models is ComfyUI/models/ipadapter
. No panic: the legacy ComfyUI/custom_nodes/ComfyUI_IPAdapter_plus/models
location still works and nothing will break.
2023/11/08: Added attention masking.
2023/11/07: Added three ways to apply the weight. See below for more info. This might break things! Please let me know if you are having issues. When loading an old workflow try to reload the page a couple of times or delete the IPAdapter Apply
node and insert a new one.
2023/11/02: Added compatibility with the new models in safetensors format (available on huggingface).
(previous updates removed for better readability)
The IPAdapter are very powerful models for image-to-image conditioning. Given a reference image you can do variations augmented by text prompt, controlnets and masks. Think of it as a 1-image lora.
The example directory has many workflows that cover all IPAdapter functionalities.
π€ Basic usage video
Download or git clone this repository inside ComfyUI/custom_nodes/
directory or use the Manager. Beware that the automatic update of the manager sometimes doesn't work and you may need to upgrade manually.
The pre-trained models are available on huggingface, download and place them in the ComfyUI/models/ipadapter
directory (create it if not present). You can also use any custom location setting an ipadapter
entry in the extra_model_paths.yaml
file.
IPAdapter also needs the image encoders. You need the CLIP-ViT-H-14-laion2B-s32B-b79K and CLIP-ViT-bigG-14-laion2B-39B-b160k image encoders, you may already have them. If you don't, download them but be careful because the file name is the same! Rename them to something easy to remember and place them in the ComfyUI/models/clip_vision/
directory.
The following table shows the combination of Checkpoint and Image encoder to use for each IPAdapter Model. Any Tensor size error you may get it is likely caused by a wrong combination.
SD v. | IPadapter | Img encoder | Nodes |
---|---|---|---|
v1.5 | ip-adapter_sd15 | ViT-H | Basic model, average strength |
v1.5 | ip-adapter_sd15_light | ViT-H | Light model, very light impact |
v1.5 | ip-adapter-plus_sd15 | ViT-H | Plus model, very strong |
v1.5 | ip-adapter-plus-face_sd15 | ViT-H | Face model, use only for faces |
v1.5 | ip-adapter-full-face_sd15 | ViT-H | Strongher face model, not necessarily better |
v1.5 | ip-adapter_sd15_vit-G | ViT-bigG | Base model trained with a bigG encoder |
SDXL | ip-adapter_sdxl | ViT-bigG | Base SDXL model, mostly deprecated |
SDXL | ip-adapter_sdxl_vit-h | ViT-H | New base SDXL model |
SDXL | ip-adapter-plus_sdxl_vit-h | ViT-H | SDXL plus model, stronger |
SDXL | ip-adapter-plus-face_sdxl_vit-h | ViT-H | SDXL face model |
FaceID requires insightface
and onnxruntime
, you need to install them in your ComfyUI environment with pip
, it's also a good idea to try to upgrade them with pip install --upgrade ...
.
When the dependencies are satisfied you need:
- The main SD1.5 model to be placed into the ipadapter models directory.
- The Lora to be planced into
ComfyUI/models/loras/
directory.
There is no SDXL model at the moment.
There's a basic workflow included in this repo and a few examples in the examples directory. Usually it's a good idea to lower the weight
to at least 0.8
.
The noise
parameter is an experimental exploitation of the IPAdapter models. You can set it as low as 0.01
for an arguably better result.
More info about the noise option
Basically the IPAdapter sends two pictures for the conditioning, one is the reference the other --that you don't see-- is an empty image that could be considered like a negative conditioning.
What I'm doing is to send a very noisy image instead of an empty one. The noise
parameter determines the amount of noise that is added. A value of 0.01
adds a lot of noise (more noise == less impact becaue the model doesn't get it); a value of 1.0
removes most of noise so the generated image gets conditioned more.
The reference image needs to be encoded by the CLIP vision model. The encoder resizes the image to 224Γ224 and crops it to the center!. It's not an IPAdapter thing, it's how the clip vision works. This means that if you use a portrait or landscape image and the main attention (eg: the face of a character) is not in the middle you'll likely get undesired results. Use square pictures as reference for more predictable results.
I've added a PrepImageForClipVision
node that does all the required operations for you. You just have to select the crop position (top/left/center/etc...) and a sharpening amount if you want.
In the image below you can see the difference between prepped and not prepped images.
The IPAdapter generally requires a few more steps
than usual, if the result is underwhelming try to add 10+ steps. The model tends to burn the images a little. If needed lower the CFG scale.
The noise
option generally grants better results, experiment with it.
The model is very effective when paired with a ControlNet. In the example below I experimented with Canny. The workflow is in the examples directory.
IPAdapter offers an interesting model for a kind of "face swap" effect. The workflow is provided. Set a close up face as reference image and then input your text prompt as always. The generated character should have the face of the reference. It also works with img2img given a high denoise.
Note: there's a new full-face
model available that's arguably better.
The most effective way to apply the IPAdapter to a region is by an inpainting workflow. Remeber to use a specific checkpoint for inpainting otherwise it won't work. Even if you are inpainting a face I find that the IPAdapter-Plus (not the face one), works best.
It is possible to pass multiple images for the conditioning with the Batch Images
node. An example workflow is provided; in the picture below you can see the result of one and two images conditioning.
It seems to be effective with 2-3 images, beyond that it tends to blur the information too much.
When sending multiple images you can increase/decrease the weight of each image by using the IPAdapterEncoder
node. The workflow (included in the examples) looks like this:
The node accepts 4 images, but remember that you can send batches of images to each slot.
You can choose how the IPAdapter weight is applied to the image embeds. Options are:
- original: The weight is applied to the aggregated tensors. The weight works predictably for values greater and lower than 1.
- linear: The weight is applied to the individual tensors before aggretating them. Compared to
original
the influence is weaker when weight is <1 and stronger when >1. Note: at weight1
the two methods are equivalent. - channel penalty: This method is a modified version of Lvmin Zhang's (Fooocus). Results are sometimes sharper. It works very well also when weight is >1. Still experimental, may change in the future.
The image below shows the difference (zoom in).
In the examples directory you can find a workflow that lets you easily compare the three methods.
Note: I'm not still sure whether all methods will stay. Linear
seems the most sensible but I wanted to keep the original
for backward compatibility. channel penalty
has a weird non-commercial clause but it's still part of a GNU GPLv3 software (ie: there's a licensing clash) so I'm trying to understand how to deal with that.
It's possible to add a mask to define the area where the IPAdapter will be applied to. Everything outside the mask will ignore the reference images and will only listen to the text prompt.
It is suggested to use a mask of the same size of the final generated image.
In the picture below I use two reference images masked one on the left and the other on the right. The image is generated only with IPAdapter and one ksampler (without in/outpainting or area conditioning).
It is also possible to send a batch of masks that will be applied to a batch of latents, one per frame. The size should be the same but if needed some normalization will be performed to avoid errors. This feature also supports (experimentally) AnimateDiff including context sliding.
In the examples directory you'll find a couple of masking workflows: simple and two masks.
In the Apply IPAdapter
node you can set a start and an end point. The IPAdapter will be applied exclusively in that timeframe of the generation. This is a very powerful tool to modulate the intesity of IPAdapter models.
FaceID is a new IPAdapter model that takes the embeddings from InsightFace. As such you need to install insightface
in your ComfyUI python environment. You may also need onnxruntime
and onnxruntime-gpu
. Note that your CUDA version might not be compatible with onnxruntime, in that case you can select the "CPU" provider from the Load InsightFace model
node.
The first time you use InsightFace the model will be downloaded automatically, check the console to see the progress. If you get an error you need to donwload the buffalo_l model manually inside the ComfyUI/models/insightface/models
directory. Also every time you run the workflow for the first time InsightFace will take quite a few seconds to load.
The FaceID model is used in conjuction with its Lora! Check the installation instructions for the links to all models.
The reference image needs to be prepared differently compared to the other IPAdapter face models. While standard face models expect the face to take basically the whole frame, FaceID prefers the subject to be a little further away. Don't cut the face too close and leave hair, beard, ears, neck in the picture.
InsightFace will often fail to detect the face and it will throw an error. Try with a different picture possibly cut to half-bust. FaceID generally works with drawings/illustrations too and the result is often very nice.
I just implemented the FaceID code so I don't have best practices yet and more testing is needed. It's important to understand that FaceID can (and should) be used as a first pass for an additional IPAdapter Face model.
In the examples directory you'll find a few workflows to get you started with FaceID.
The following would be a basic workflow that includes FaceID enhanced by a Plus Face model.
Please check the troubleshooting before posting a new issue.
If you are interested I've also implemented the same features for Huggingface Diffusers.
Let me know if you spot the IPAdapter in the wild or tag @latentvision in the video description!
- For German speakers you can find interesting YouTube tutorials on A Latent Place.
- In Chinese Introversify
- Scott Detweiler covered this extension.