A deep learning project that automatically generates natural language descriptions for images using various neural network architectures. This project is the implementation of the research paper "Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms."
This project implements and compares multiple image captioning models, systematically analyzing the impact of transfer learning and attention mechanisms on encoder-decoder architectures:
-
Baseline CNN-LSTM Model: A traditional encoder-decoder architecture using a pre-trained ResNet-50 CNN for image feature extraction and LSTM for caption generation.
-
Two-Stage Object Detection Encoder with LSTM Decoder: Enhanced model that incorporates Mask R-CNN (with ResNet-50-FPN backbone) for detailed feature extraction through object detection, while maintaining the LSTM decoder.
-
Two-Stage Object Detection Encoder with Attentive Transformer Decoder: Further improvement that maintains the Mask R-CNN encoder but replaces the LSTM with a Transformer-based decoder equipped with cross-attention mechanisms, enabling more context-aware caption generation.
-
Vision Transformer (ViT) Encoder with GPT-2 Decoder: Modern approach using Vision Transformers (ViT) for the encoder and GPT-2 for the decoder, leveraging large-scale transformer architectures to capture comprehensive visual and contextual information.
Our key findings include:
- Progressive performance improvements from the baseline CNN-LSTM to the ViT-GPT-2 architecture
- The significant impact of attention mechanisms in both encoder and decoder components
- Challenges in transferring knowledge from large, general pre-training datasets to smaller, specialized datasets
- Training instabilities encountered with large transformer-based models
The models are evaluated using standard metrics for image captioning:
- BLEU (Bilingual Evaluation Understudy)
- METEOR (Metric for Evaluation of Translation with Explicit ORdering)
- CIDEr (Consensus-based Image Description Evaluation)
The project supports two datasets:
- Flickr8k: A smaller dataset with 8,000 images and five human-written captions per image
- Flickr30k: A larger dataset with 30,000 images
The datasets are split into training, validation, and testing sets. Images are preprocessed with resizing, cropping, and data augmentation techniques including random horizontal flipping, rotation, and color jittering.
Virtual environment setup
conda env create -f conda.yml
conda activate image-captioning-project
Or if you already have the conda env, you can update it by running this:
conda env update --file conda.yml --prune
Install the precommit hook
pre-commit install
To download the Flickr8k dataset
sh download_flickr8k.sh
To download the Flickr30k dataset
sh download_flickr30k.sh
The project includes a web-based demo application that allows you to:
- View captions generated by different models for sample images
- Upload your own images and generate captions
- Compare the performance of different models side by side
- Visualize the attention mechanisms in action
Run the following code:
uvicorn app.main:app --reload
Access the web interface at: http://127.0.0.1:8000
app/
: Web application for demonstrating the modelsmodels/
: Implementation of different image captioning architecturesmodel_1_baseline_cnn_lstm/
: Baseline ResNet-50 CNN with LSTM decodermodel_2_image_segmentation_lstm/
: Mask R-CNN encoder with LSTM decodermodel_3_image_segmentation_attention_decoder/
: Mask R-CNN encoder with attentive transformer decodermodel_4_vision_transformer/
: ViT encoder with GPT-2 decoder
data/
: Data handling and preprocessing utilitiesmetrics.py
: Evaluation metrics implementation (BLEU, METEOR, CIDEr)conda.yml
: Environment configuration
If you use this code for your research, please cite our paper:
@article{lee2023optimizing,
title={Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms},
author={Lee, Jed Woon Kiat and Koh, Quan Wei Ivan},
journal={},
year={2023}
}