You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to Dalle2 paper,Prior model is used to predict clip image embeddings from clip text embeddings. I think they design this model to minimize modality gap
I just don't konw why we need to use diffusion model for Prior. I know we can, but why we don't use a simpler network(I don't know ,maybe just a MLP) to implements the mapping from image embeddings to text embeddings. Diffusion model is quite expensive in terms of time and compute.
The text was updated successfully, but these errors were encountered:
Hi , I am a greenhorn for diffusion model
According to Dalle2 paper,Prior model is used to predict clip image embeddings from clip text embeddings. I think they design this model to minimize modality gap
I just don't konw why we need to use diffusion model for Prior. I know we can, but why we don't use a simpler network(I don't know ,maybe just a MLP) to implements the mapping from image embeddings to text embeddings. Diffusion model is quite expensive in terms of time and compute.
The text was updated successfully, but these errors were encountered: