Replies: 1 comment 3 replies
-
Hi, great observation and nice example! This is one of the interesting emerging property that we see from LLaVA, although it has not been explicit trained / instructed to perform text recognition in images (OCR). Such data is also scarce in our training. One possible explanation would be that these were learnt during the CLIP pretraining (our vision encoder), and some of these capability are transferred to our model, during the feature alignment process. We are working on exploring this, and also seeking for improvements on these interesting capabilities, to make the LLaVA even better! Looking forward to more discussions :) |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was wondering if it's capable of reading text in the images. It seems to do ok!
Beta Was this translation helpful? Give feedback.
All reactions