CoCa model implementation

### 🚀 The feature, motivation and pitch

Thank you for your awesome works!
I have some questions about CoCa model implementation.

1) 
In */multimodal/torchmultimodal/models/coca/coca_model.py, it seems like we can decide whether using CascadedAttentionPooler or just single AttentionPooler. 
However, when using CascadedAttentionPooler, dimensions are not matched at the second loop.

For example, after vision feature is extracted from VisionEncoder and its feature has shape of (B, h*w, dim). 
It has to pass through vision_pooler layers (pooled_outputs = self.vision_pooler(image_embeddings)) and when using CascadedAttentionPooler, 'self.vision_pooler' class has 2 sequential AttentionPooler layers.
After passed through 1st AttentionPooler layer, feature has shape of (B,256,q_dim) and it doesn't matched with the LayerNorm in the second loop which is supporting 'dim', not 'q_dim'.
Is it okay if I arbitrarily modify the input dimension of the second AttnetionPooler layer?

2) 
Similary, when using 'vision_cls_token' with CascadedAttentionPooler, shape of vision feature is (B, h*w + 1(cls), dim) (e.g., B,1025,768).
And at the vision_pooler layer, it return learnable tokens after cross-attention with vision feature and it has (B,256,q_dim) shape for each captioning_image_embeddings, and contrastive_image_embeddings, respectively.
If you intended to not using visual features directly, is it necessary to add 'cls_token' at the initial stage?
I mean, what is the purpose of adding 'cls_token' at the front of visual features even though, you're not using them directly.

Thank you again!

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CoCa model implementation #517

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CoCa model implementation #517

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions