Skip to content

CoCa model implementation #517

@seungkyuK

Description

@seungkyuK

🚀 The feature, motivation and pitch

Thank you for your awesome works!
I have some questions about CoCa model implementation.

In */multimodal/torchmultimodal/models/coca/coca_model.py, it seems like we can decide whether using CascadedAttentionPooler or just single AttentionPooler.
However, when using CascadedAttentionPooler, dimensions are not matched at the second loop.

For example, after vision feature is extracted from VisionEncoder and its feature has shape of (B, h*w, dim).
It has to pass through vision_pooler layers (pooled_outputs = self.vision_pooler(image_embeddings)) and when using CascadedAttentionPooler, 'self.vision_pooler' class has 2 sequential AttentionPooler layers.
After passed through 1st AttentionPooler layer, feature has shape of (B,256,q_dim) and it doesn't matched with the LayerNorm in the second loop which is supporting 'dim', not 'q_dim'.
Is it okay if I arbitrarily modify the input dimension of the second AttnetionPooler layer?

Similary, when using 'vision_cls_token' with CascadedAttentionPooler, shape of vision feature is (B, h*w + 1(cls), dim) (e.g., B,1025,768).
And at the vision_pooler layer, it return learnable tokens after cross-attention with vision feature and it has (B,256,q_dim) shape for each captioning_image_embeddings, and contrastive_image_embeddings, respectively.
If you intended to not using visual features directly, is it necessary to add 'cls_token' at the initial stage?
I mean, what is the purpose of adding 'cls_token' at the front of visual features even though, you're not using them directly.

Thank you again!

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions