Skip to content

ivankqw/image-captioning-project

Repository files navigation

Image Captioning Project

A deep learning project that automatically generates natural language descriptions for images using various neural network architectures. This project is the implementation of the research paper "Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms."

Project Overview

This project implements and compares multiple image captioning models, systematically analyzing the impact of transfer learning and attention mechanisms on encoder-decoder architectures:

  1. Baseline CNN-LSTM Model: A traditional encoder-decoder architecture using a pre-trained ResNet-50 CNN for image feature extraction and LSTM for caption generation.

  2. Two-Stage Object Detection Encoder with LSTM Decoder: Enhanced model that incorporates Mask R-CNN (with ResNet-50-FPN backbone) for detailed feature extraction through object detection, while maintaining the LSTM decoder.

  3. Two-Stage Object Detection Encoder with Attentive Transformer Decoder: Further improvement that maintains the Mask R-CNN encoder but replaces the LSTM with a Transformer-based decoder equipped with cross-attention mechanisms, enabling more context-aware caption generation.

  4. Vision Transformer (ViT) Encoder with GPT-2 Decoder: Modern approach using Vision Transformers (ViT) for the encoder and GPT-2 for the decoder, leveraging large-scale transformer architectures to capture comprehensive visual and contextual information.

Research Findings

Our key findings include:

  • Progressive performance improvements from the baseline CNN-LSTM to the ViT-GPT-2 architecture
  • The significant impact of attention mechanisms in both encoder and decoder components
  • Challenges in transferring knowledge from large, general pre-training datasets to smaller, specialized datasets
  • Training instabilities encountered with large transformer-based models

The models are evaluated using standard metrics for image captioning:

  • BLEU (Bilingual Evaluation Understudy)
  • METEOR (Metric for Evaluation of Translation with Explicit ORdering)
  • CIDEr (Consensus-based Image Description Evaluation)

Datasets

The project supports two datasets:

  • Flickr8k: A smaller dataset with 8,000 images and five human-written captions per image
  • Flickr30k: A larger dataset with 30,000 images

The datasets are split into training, validation, and testing sets. Images are preprocessed with resizing, cropping, and data augmentation techniques including random horizontal flipping, rotation, and color jittering.

Setup

Virtual environment setup

conda env create -f conda.yml
conda activate image-captioning-project

Or if you already have the conda env, you can update it by running this:

conda env update --file conda.yml --prune

Install the precommit hook

pre-commit install

To download the Flickr8k dataset

sh download_flickr8k.sh

To download the Flickr30k dataset

sh download_flickr30k.sh

Demo Application

The project includes a web-based demo application that allows you to:

  • View captions generated by different models for sample images
  • Upload your own images and generate captions
  • Compare the performance of different models side by side
  • Visualize the attention mechanisms in action

Running the Demo

Run the following code:

uvicorn app.main:app --reload

Access the web interface at: http://127.0.0.1:8000

Project Structure

  • app/: Web application for demonstrating the models
  • models/: Implementation of different image captioning architectures
    • model_1_baseline_cnn_lstm/: Baseline ResNet-50 CNN with LSTM decoder
    • model_2_image_segmentation_lstm/: Mask R-CNN encoder with LSTM decoder
    • model_3_image_segmentation_attention_decoder/: Mask R-CNN encoder with attentive transformer decoder
    • model_4_vision_transformer/: ViT encoder with GPT-2 decoder
  • data/: Data handling and preprocessing utilities
  • metrics.py: Evaluation metrics implementation (BLEU, METEOR, CIDEr)
  • conda.yml: Environment configuration

Citation

If you use this code for your research, please cite our paper:

@article{lee2023optimizing,
  title={Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms},
  author={Lee, Jed Woon Kiat and Koh, Quan Wei Ivan},
  journal={},
  year={2023}
}

About

deep learning image captioning project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •