Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Coca architecture #2371

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Implemented Coca architecture #2371

wants to merge 13 commits into from

Conversation

VarunS1997
Copy link
Collaborator

@VarunS1997 VarunS1997 commented Mar 4, 2024

What does this PR do?

Implements the work done in the "CoCa": Contrastive Captioners are Image-Text Foundation Models" (https://arxiv.org/pdf/2205.01917.pdf).

This PR requires:

  • Model Implementation
  • Documentation
  • Testing

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link
    to it if that's the case.
  • Did you write any new necessary tests?
  • If this adds a new model, can you run a few training steps on TPU in Colab to ensure that no XLA incompatible OP are used?

@VarunS1997 VarunS1997 added the wip working in progress from KerasCV team label Mar 4, 2024
@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 5, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 5, 2024
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @VarunS1997 !! left a few comments. Also, please add a colab to verify the output.

keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved
keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved
keras_cv/layers/attention_pooling.py Outdated Show resolved Hide resolved
keras_cv/layers/attention_pooling.py Outdated Show resolved Hide resolved
keras_cv/layers/attention_pooling.py Outdated Show resolved Hide resolved
keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved
keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved
keras_cv/models/feature_extractor/CoCa/coca_model.py Outdated Show resolved Hide resolved
@divyashreepathihalli
Copy link
Collaborator

One additional overhead work is needed.

please add keras_cv/models/feature_extractor/coca to this file
https://github.com/keras-team/keras-cv/blob/master/.kokoro/github/ubuntu/gpu/build.sh
to line 72 and 86

PS: we will fix this overhead soon, but in the mean time this is what we need to do to make sure the large GPU tests run.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left a few comments.

keras_cv/layers/attention_pooling.py Outdated Show resolved Hide resolved
import numpy as np
from keras import Sequential
from keras_cv.api_export import keras_cv_export
from keras_nlp.layers import RotaryEmbedding, TransformerDecoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were doing a conditional import of keras_nlp, so keras_cv was installable without keras-nlp installed if you are using unrelated features. But that was when the only use was for one tokenizer.

if keras_nlp is None:
raise ValueError(
"ClipTokenizer requires keras-nlp. Please install "
"using pip `pip install -U keras-nlp && pip install -U keras`"
)

We could reconsider if keras-cv should hard depend on keras-nlp if we want more stuff like this? No strong feelings. @divyashreepathihalli fyi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as we support more multi modal models we should depend on KerasNLP. If the tf-text install issue is resolved, we should add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm! Though if we switch to a hard dependency here, we should probably add keras-nlp as a dependency in setup.py (which comes with a transitive dependency on tensorflow-text and tensorflow just fyi).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to include that in this PR? There's already some imports of Keras-NLP in other places of Keras CV.

If we make it a separate PR, it'll make it easier to rollback if we need to. Considering it's a new dependency, might be worth separating.

keras_cv/models/feature_extractor/coca/coca_model.py Outdated Show resolved Hide resolved

# [batch_size, sequence_length+1, text_dim]
text_tokens = np.concatenate(texts, self.cls_token)
mask = np.concatenate((np.ones_like(texts), np.zeros_like(self.cls_token)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if this mask is the right shape or not. Usually you want something (batch_size, seq_length, seq_length) (or seq lenth + 1 if that is the effective sequence length. What is text_dim here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text_dim is the dimensionality of the text embeddings

@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Mar 11, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 11, 2024
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a few comments
Also, please do add a colab demo to show that the model is working as expected.

super().build(input_shape)

self.multi_head_attn.build(input_shape)
self.layer_norm.build(input_shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the input shape for layer_norm correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it, let me know if it still doesn't look right!

self.image_patching = PatchingAndEmbedding(
self.encoder_width, self.img_patch_size
)
self.image_encoder = Sequential(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential might not work, the model will not build properly.

)

self.text_embedding = RotaryEmbedding()
self.unimodal_text_decoder = Sequential(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, sequential might not work well. Please double check.

for _ in range(self.unimodal_decoder_depth)
]
)
self.multimodal_text_decoder = Sequential(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about Sequential

num_patches = (images_shape[1] // self.img_patch_size) * (
images_shape[2] // self.img_patch_size
) + 1
self.image_encoder.build((batch_size, self.encoder_width, num_patches))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could keep batch_size as None
example
self.image_encoder.build((None, self.encoder_width, num_patches))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, is there a specific reason to do that over setting the batch_size?

contrastive loss from the ViT and the uni-modal Text Decoder is combined with a captioning loss from the multi-modal
Decoder in order to produce the combined total loss.

Basic Usage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be changed to Example: since we follow only Example or Examples: as a standard format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip working in progress from KerasCV team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants