This is a PyTorch-based reimplementation of CrossFlow, as proposed in
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Qihao Liu | Xi Yin | Alan Yuille | Andrew Brown | Mannat Singh
[project page] | [paper] | [arxiv]
This repository provides a PyTorch-based reimplementation of CrossFlow for the text-to-image generation task, with the following differences compared to the original paper:
- Model Architecture: The original paper utilizes DiMR as the model architecture. In contrast, this codebase supports training and inference with both DiT (ICCV 2023, a widely adopted architecture) and DiMR (NeurIPS 2024, a state-of-the-art architecture).
- Dataset: The original model was trained on proprietary 350M dataset. In this implementation, the models are trained on open-source datasets, including LAION-400M and JourneyDB (4M).
- LLMs: The original 1B model only supports CLIP as the language model, whereas this implementation includes 1B models with CLIP and T5-XXL.
-
Release inference code and 512px CLIP DiMR-based model. -
Release training code and a detailed training tutorial (ETA: Dec 20). -
Release inference code for linear interpolation and arithmetic. -
Release all pretrained checkpoints, including: (ETA: Dec 23) -
Update pretrained checkpoints (ETA: Dec 28) - Provide a demo via Hugging Face Space and Colab.
-
The code has been tested with PyTorch 2.1.2 and Cuda 12.1.
An example of installation commands is provided as follows:
git clone [email protected]:qihao067/CrossFlow.git cd CrossFlow pip3 install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip3 install -U --pre triton pip3 install -r requirements.txt
-
To train or test the model, you will also need to download the VAE model from Stable Diffusion, and the reference statistics for zero-shot FID on the MSCOCO validation set. For your convenience, you can directly download all the models from here.
Architecture | Resolution | LM | Download | Details |
---|---|---|---|---|
DiMR | 256x256 | CLIP | [t2i_256px_clip_dimr.pth] | Train from scratch on LIAON-400M for 1 epoch, then fine-tune on JourneyDB for 10 epochs. |
DiMR | 256x256 | T5-XXL | [t2i_256px_t5_dimr.pth] | Initialize with [t2i_256px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. |
DiMR | 512x512 | CLIP | [t2i_512px_clip_dimr.pth] | Initialize with [t2i_256px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. (Model with the best T-I alignment*) |
DiMR | 512x512 | T5-XXL | [t2i_512px_t5_dimr.pth] | Initialize with [t2i_512px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. |
DiT | 512x512 | T5-XXL | [t2i_512px_t5_dit.pth] | Initialize with [t2i_512px_clip_dimr.pth] and fine-tune on JourneyDB for 10 epochs. |
*To save training time, all T5-XXL-based models are initialized with a CLIP-based model and fine-tuned on JourneyDB (4M) for ten epochs. As a result, these models may occasionally exhibit very minor text-image misalignments, which are not observed in the original paper's T5 models since they are trained from scratch.
-
You can sample from the pre-trained CrossFLow model with the
demo_t2i.py
. Before running the script, download the appropriate checkpoint and configure hyperparameters such as the classifier-free guidance scale, random seed, and mini-batch size in the corresponding configuration files.To accelerate the sampling process, the script supports multi-GPU sampling. For example, to sample from the 512px CLIP DiMR-based CrossFlow model with
N
GPUs, you can use the following command. It generatesN x mini-batch size
images each time:# if only sample with one GPU: # accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i.py \ accelerate launch --multi_gpu --num_processes N --mixed_precision bf16 demo_t2i.py \ --config=configs/t2i_512px_clip_dimr.py \ --nnet_path=path/to/t2i_512px_clip_dimr.pth \ --img_save_path=temp_saved_images \ --prompt='your prompt' \
-
Our model provides visually smooth interpolations in the latent space. By using the
demo_t2i_arith.py
script, images can be generated through linear interpolation between two input prompts using the following command:accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i_arith.py \ --config=configs/t2i_512px_clip_dimr.py \ --nnet_path=path/to/t2i_512px_clip_dimr.pth \ --img_save_path=temp_saved_images \ --test_type=interpolation \ --prompt_1='A dog cooking dinner in the kitchen' \ --prompt_2='An orange cat wearing sunglasses on a ship' \ --num_of_interpolation=40 \ --save_gpu_memory \
This script supports sampling on a single GPU only. For linear interpolation, you need to adjust the
num_of_interpolation
parameter, which controls the number of interpolated images generated. The script requires a minimum of5
images, but we recommend setting it to40
for smoother interpolations. Additionally, you can enable thesave_gpu_memory
option to optimize GPU VRAM usage, though this will require extra time.Finally, the command will generate
num_of_interpolation
images in the specifiedimg_save_path
. Using the provided random seed (1234
), the resulting images will appear as follows -
Our model supports arithmetic operations in the text latent space. Using the Text Variational Encoder, we first encode the input text into the latent space. Arithmetic operations are then applied within this latent space, and the resulting latent representation is utilized to generate the corresponding image. An example can be demonstrated using the following command:
accelerate launch --num_processes 1 --mixed_precision bf16 demo_t2i_arith.py \ --config=configs/t2i_512px_clip_dimr.py \ --nnet_path=path/to/t2i_512px_clip_dimr.pth \ --img_save_path=temp_saved_images \ --test_type=arithmetic \ --prompt_ori='A corgi wearing a red hat in the park' \ --prompt_a='book' \ --prompt_s='hat' \
The images generated in the folder
img_save_path
include images of the the input prompts, followed by the resulting image after the arithmetic operations (prompt_ori + prompt_a - prompt_s
) .
We also support single arithmetic operation. You can perform addition by providing only prompt_ori
and prompt_a
, or subtraction by providing only prompt_ori
and prompt_s
.
-
To train the CrossFlow model, you need a dataset consisting of image-text pairs. We provide a demo dataset (download here) containing 100 images sourced from JourneyDB. The dataset includes an image folder and a
.jsonl
file that specifies the image paths and their corresponding captions.To accelerate the training process, you can cache the image latents (from a VAE) and text embeddings (from a language model such as CLIP or T5-XXL) beforehand. We offer preprocessing scripts to simplify this step.
Specifically, you can use the
scripts/extract_train_feature.py
script to extract and save these features. Before running the script, ensure that you update the dataset paths (json_path
androot_path
) and set an appropriate batch size (bz
) . Once the features are generated, move them to the dataset directory. The training dataset should then have the following structure. Additionally, remember to update the training dataset path in the configuration file.training_dataset ├── img_text_pair.jsonl ├── imgs ├── 00a44b26-9bb4-415a-980e-a879afcb7e18.jpg └── ... └── features # feature generated by `extract_train_feature.py` ├── 00a44b26-9bb4-415a-980e-a879afcb7e18.npy └── ...
Similarly, we cache the latents and embeddings for the test set (e.g., MSCOCO) and the prompts used for visualization during the validation step of the training process. This can be achieved by running the following scripts
scripts/extract_mscoco_feature.py
,scripts/extract_empty_feature.py
, andscripts/extract_test_prompt_feature.py
. Before running these scripts, ensure you have downloaded the MSCOCO validation set. Then, update the dataset paths and specify the language model in each script. Once the features are generated, organize them into the following file structure in the validation dataset directory. Additionally, make sure to update the validation path in the configuration file.val_dataset ├── empty_context.npy ├── run_vis ├── 0.npy └── ... └── val ├── 0.npy └── ...
-
We provide a training script for text-to-image (T2I) generation in
train_t2i.py
. Additionally, a demo configuration file is available att2i_training_demo.py
. Before starting the training, adjust the settings in the configuration file as indicated by the comments. Once configured, you can launch the training process usingN
GPUs on a single node:accelerate launch --multi_gpu --num_processes N --num_machines 1 --mixed_precision bf16 train_t2i_discrete.py \ --config=configs/t2i_training_demo.py
The project is created for research purposes.
This codebase is built upon the following repository:
Much appreciation for their outstanding efforts.
If you use our work in your research, please use the following BibTeX entry.
@article{liu2024flowing,
title={Flowing from Words to Pixels: A Framework for Cross-Modality Evolution},
author={Liu, Qihao and Yin, Xi and Yuille, Alan and Brown, Andrew and Singh, Mannat},
journal={arXiv preprint arXiv:2412.15213},
year={2024}
}