-
Hello! I have a multimodal classification task. In my dataset, each record consists of
Probably, I'd want to use transformers encoders to represent both text and images and then fine-tune them. In your experience, what is the best way to represent a variable number of images associated with the same record? (I also thought of taking the average of the different image embeddings or similar approaches. I might do so if I froze the encoder, but I want to fine-tune it). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @anakin87, your task fits right in and we're glad you stumbled across Ludwig! Averaging the image embeddings is a good idea. Here are a few more:
I'd go with this option as it will be cheaper in terms of resources to train and iterate. It will create a single encoder for the image feature and another one for the text feature which you can fine tune/freeze as you like.
This will mean that some of your image features will contain null values. The default
With this, you'll have one shared encoder for the image features and one encoder for the text feature. This will be fastest in terms of training but the effect of the null values on model performance will vary depending on how many null values there will be. If you have a way to augment the image features, this could be a good option. Independent of which option you choose, you can explore image augmentation that we just added to Ludwig. We haven't worked on the exact same problem, but we worked on a similar problem where we had different (but close) labels for the same data record, and we found that the general idea of approach 1 worked well. I'm curious what you end up choosing and the results! Feel free to keep me updated and we can discuss more. |
Beta Was this translation helpful? Give feedback.
Hi @anakin87, your task fits right in and we're glad you stumbled across Ludwig!
Averaging the image embeddings is a good idea. Here are a few more:
I'd go with this opti…