You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I suppose the code is for model training, where pairs of [img, caption] is available.
Why do we feed caption (our target predictions) into the decoder? Shouldn't the decoder only take encoded as input, and produce predictions for caption?
How should I use the trained model for inference, when onlyimg is available (andcaption is unknown/hidden)?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hello,
Thank you for creating a great repository. I'm new to
x-transformers
and I'm a bit confused about the provided sample usage for image captioning:I suppose the code is for model training, where pairs of
[img, caption]
is available.caption
(our target predictions) into the decoder? Shouldn't thedecoder
only takeencoded
as input, and produce predictions forcaption
?img
is available (andcaption
is unknown/hidden)?Thanks in advance!
The text was updated successfully, but these errors were encountered: