Confusion about image->caption example #239

mtran14 · 2024-02-09T08:34:10Z

Hello,

Thank you for creating a great repository. I'm new to x-transformers and I'm a bit confused about the provided sample usage for image captioning:

import torch
from x_transformers import ViTransformerWrapper, TransformerWrapper, Encoder, Decoder

encoder = ViTransformerWrapper(
    image_size = 256,
    patch_size = 32,
    attn_layers = Encoder(
        dim = 512,
        depth = 6,
        heads = 8
    )
)

decoder = TransformerWrapper(
    num_tokens = 20000,
    max_seq_len = 1024,
    attn_layers = Decoder(
        dim = 512,
        depth = 6,
        heads = 8,
        cross_attend = True
    )
)

img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

encoded = encoder(img, return_embeddings = True)
decoder(caption, context = encoded) # (1, 1024, 20000)

I suppose the code is for model training, where pairs of [img, caption] is available.

Why do we feed caption (our target predictions) into the decoder? Shouldn't the decoder only take encoded as input, and produce predictions for caption?
How should I use the trained model for inference, when onlyimg is available (andcaption is unknown/hidden)?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

mk-runner · 2024-04-18T14:52:26Z

I also have the same question, hoping to clarify it. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about image->caption example #239

Confusion about image->caption example #239

mtran14 commented Feb 9, 2024

mk-runner commented Apr 18, 2024

Confusion about image->caption example #239

Confusion about image->caption example #239

Comments

mtran14 commented Feb 9, 2024

mk-runner commented Apr 18, 2024