-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference on my own data? #1
Comments
|
@j-min thanks for the kind reply. When you have time would it be possible for you to provide a jupyter (or colab) example of how to make inferences on the given image-question pair? I think it would greatly help people like me. For example, like this detectron2 tutorial. Once again, thank you so much for your work! |
@j-min Also, what information does 'vis_feats' actually hold? The input_ids must be the ids of the tokenized question and the boxes must be the bounding box coordinates of the objects detected in the image right? I am unsure of what information vis_feats contains. Not very familiar with faster R-CNN or detectron2 so I hope you'll understand even if my question is very basic :( |
As you can see in the |
@j-min This is the faster R-CNN provided by the huggingface repo from LXMERT.
|
I am using the detectron2 for feature extraction for generative vqa, but I am getting empty outputs. Can you let me know if I am passing in the data correctly?
The output is generated answer = [''] |
I'm afraid if you didn't load the pretrained checkpoint properly. Please check out |
I created a google colab for custom image processing. Hope this helps. |
@j-min
After creating my own batches of data be enough to fine-tune the model? |
@j-min |
Yes, the py-bottom-up-attention repo is compatible with huggingface transformer LXMERT demo. VCR questions (https://visualcommonsense.com/explore/?im=2519) have a different format than VQA, for example, person grounding / multiple-choice. So I don't think such fine-tuning is trivial. You can search or create a new dataset that has a format similar to VQA with longer answers. Once you get a custom VQA dataset, then you can finetune VLT5 or VLT5VQA model with that. You can start by modifying the Dataset class in |
Looking at the VL-T5 paper, it seems like the decoder generates text in an autoregressive manner i.e. it predicts the probability of future text tokens (among all the tokens it already knows) conditioned on the encoder outputs and the previously generated tokens. Is this understanding of mine correct? if so if I want to fine-tune with additional training examples (with longer answers) then do I need to modify the vocab size of the output space (so that it is tailored for the additional fine-tuning data that I have)? |
|
@j-min If I want to train, then do I simply have to create a VLT5VQA model instance like
and then simply do
in order to train? |
@j-min what information does I am not so familiar with this concept so I'm sorry if the question is too basic. Are scores the intersection over union between the reference boxes and the boxes detected by faster R-CNN? If so, how can I obtain the reference boxes? I am using the COCO datasets for images (the train2014.zip: http://images.cocodataset.org/zips/train2014.zip) I just have longer questions and answers. |
|
@j-min So I accidentally fine-tuned the model using data that does not have the 'vqa:' tag for the questions, but after training it for 10 epochs and when I tested it, it seems to be working fine. |
You don't have to use |
@j-min |
Yes, you can check out VCR for such a setting. You also might want to check Visual7W and how models tackle these datasets. |
About github link of VL-T5 model about the vcr task dataset image36 "train_boxes36.h5" can't download, could you give me another link? |
Hello! First of all thank you so much for your work. I have read your paper and I want to carry out some open-ended VQA/answer generation VQA experiments with the model you proposed (VL-T5). However I am unsure where to start with the provided code. Would it be possible for you to provide an example code for extracting image features and text features for a custom dataset? (not data in VQA v2.0). I want to test if it it can generate answers based on images and questions that I have prepared.
Thank you so much and I am so sorry for troubling you.
The text was updated successfully, but these errors were encountered: