Multimodal classification - multiple images for each record #3343

anakin87 · 2023-04-14T16:30:56Z

anakin87
Apr 14, 2023

Hello!
My problem is not strictly related to Ludwig, although I want to use this awesome framework.

I have a multimodal classification task. In my dataset, each record consists of

a text
a list of 1 to 4 associated images
a label

Probably, I'd want to use transformers encoders to represent both text and images and then fine-tune them.

In your experience, what is the best way to represent a variable number of images associated with the same record?
Are there any solutions to do this using Ludwig?

(I also thought of taking the average of the different image embeddings or similar approaches. I might do so if I froze the encoder, but I want to fine-tune it).

Answered by abidwael

Apr 15, 2023

Hi @anakin87, your task fits right in and we're glad you stumbled across Ludwig!

Averaging the image embeddings is a good idea. Here are a few more:

For a record that contains more than 1 image records, you can unravel it into multiple training examples, i.e. if you have 1 training example with four images A, B, C and D, one text E, and one label F, you would transform them into 4 records where each one will have a different image, and the same text and label. Your config will be

input_features:
  - name: image_feature
    type: image
  - name: text_feature
    type: text
output_features:
  - name: label
    type: category (or other)
combiner:
  ...
trainer:
  ...

I'd go with this opti…

View full answer

abidwael · 2023-04-15T20:17:09Z

abidwael
Apr 15, 2023
Collaborator

Hi @anakin87, your task fits right in and we're glad you stumbled across Ludwig!

Averaging the image embeddings is a good idea. Here are a few more:

For a record that contains more than 1 image records, you can unravel it into multiple training examples, i.e. if you have 1 training example with four images A, B, C and D, one text E, and one label F, you would transform them into 4 records where each one will have a different image, and the same text and label. Your config will be

input_features:
  - name: image_feature
    type: image
  - name: text_feature
    type: text
output_features:
  - name: label
    type: category (or other)
combiner:
  ...
trainer:
  ...

I'd go with this option as it will be cheaper in terms of resources to train and iterate. It will create a single encoder for the image feature and another one for the text feature which you can fine tune/freeze as you like.
The flip side is that Your data size will be larger than the original dataset (but I think it won't outweigh having more encoders as in option 2 in terms of training speed). You might also run into data imbalance issues depending on how the number of images per record correlates with the labels. If you're doing binary classification, you can mitigate that by using one of the data balancing strategies.

You can keep the same number of examples in the data without any unraveling, and train with 4 image features using this config

input_features:
  - name: image_feature_1
    type: image
  - name: image_feature_2
    type: image
  - name: image_feature_3
    type: image
  - name: image_feature_4
    type: image
  - name: text_feature
    type: text
output_features:
  - name: label
    type: category (or other)
combiner:
  ...
trainer:
  ...

This will mean that some of your image features will contain null values. The default missing_value_strategy for images is to backfill, but you can adjust that to any of these options.
Note that this will create 4 image encoders and one text encoder that will all have to fit in your hardware. You can mitigate this by tying the weights of the features

input_features:
  - name: image_feature_1
    type: image
  - name: image_feature_2
    type: image
    tied: image_feature_1
  - name: image_feature_3
    type: image
    tied: image_feature_1
  - name: image_feature_4
    type: image
    tied: image_feature_1
  - name: text_feature
    type: text
output_features:
  - name: label
    type: category (or other)
combiner:
  ...
trainer:
  ...

With this, you'll have one shared encoder for the image features and one encoder for the text feature. This will be fastest in terms of training but the effect of the null values on model performance will vary depending on how many null values there will be. If you have a way to augment the image features, this could be a good option.

Independent of which option you choose, you can explore image augmentation that we just added to Ludwig.

We haven't worked on the exact same problem, but we worked on a similar problem where we had different (but close) labels for the same data record, and we found that the general idea of approach 1 worked well.
If you choose to go with option 1 and get to the point where you're doing inference, you can unravel the data in the same way, pass it through the model, get the predictions and get the final prediction through some approach such as max voting across the examples. Happy to elaborate more there if you're taking it to production.

I'm curious what you end up choosing and the results! Feel free to keep me updated and we can discuss more.

1 reply

anakin87 Apr 15, 2023
Author

Thanks @abidwael 💕. Great insights!
I'll let you know...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal classification - multiple images for each record #3343

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Multimodal classification - multiple images for each record #3343

anakin87 Apr 14, 2023

Replies: 1 comment · 1 reply

abidwael Apr 15, 2023 Collaborator

anakin87 Apr 15, 2023 Author

anakin87
Apr 14, 2023

Replies: 1 comment 1 reply

abidwael
Apr 15, 2023
Collaborator

anakin87 Apr 15, 2023
Author