Image Captioning Project

A deep learning project that automatically generates natural language descriptions for images using various neural network architectures. This project is the implementation of the research paper "Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms."

Project Overview

This project implements and compares multiple image captioning models, systematically analyzing the impact of transfer learning and attention mechanisms on encoder-decoder architectures:

Baseline CNN-LSTM Model: A traditional encoder-decoder architecture using a pre-trained ResNet-50 CNN for image feature extraction and LSTM for caption generation.
Two-Stage Object Detection Encoder with LSTM Decoder: Enhanced model that incorporates Mask R-CNN (with ResNet-50-FPN backbone) for detailed feature extraction through object detection, while maintaining the LSTM decoder.
Two-Stage Object Detection Encoder with Attentive Transformer Decoder: Further improvement that maintains the Mask R-CNN encoder but replaces the LSTM with a Transformer-based decoder equipped with cross-attention mechanisms, enabling more context-aware caption generation.
Vision Transformer (ViT) Encoder with GPT-2 Decoder: Modern approach using Vision Transformers (ViT) for the encoder and GPT-2 for the decoder, leveraging large-scale transformer architectures to capture comprehensive visual and contextual information.

Research Findings

Our key findings include:

Progressive performance improvements from the baseline CNN-LSTM to the ViT-GPT-2 architecture
The significant impact of attention mechanisms in both encoder and decoder components
Challenges in transferring knowledge from large, general pre-training datasets to smaller, specialized datasets
Training instabilities encountered with large transformer-based models

The models are evaluated using standard metrics for image captioning:

BLEU (Bilingual Evaluation Understudy)
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
CIDEr (Consensus-based Image Description Evaluation)

Datasets

The project supports two datasets:

Flickr8k: A smaller dataset with 8,000 images and five human-written captions per image
Flickr30k: A larger dataset with 30,000 images

The datasets are split into training, validation, and testing sets. Images are preprocessed with resizing, cropping, and data augmentation techniques including random horizontal flipping, rotation, and color jittering.

Setup

Virtual environment setup

conda env create -f conda.yml
conda activate image-captioning-project

Or if you already have the conda env, you can update it by running this:

conda env update --file conda.yml --prune

Install the precommit hook

pre-commit install

To download the Flickr8k dataset

sh download_flickr8k.sh

To download the Flickr30k dataset

sh download_flickr30k.sh

Demo Application

The project includes a web-based demo application that allows you to:

View captions generated by different models for sample images
Upload your own images and generate captions
Compare the performance of different models side by side
Visualize the attention mechanisms in action

Running the Demo

Run the following code:

uvicorn app.main:app --reload

Access the web interface at: http://127.0.0.1:8000

Project Structure

app/: Web application for demonstrating the models
models/: Implementation of different image captioning architectures
- model_1_baseline_cnn_lstm/: Baseline ResNet-50 CNN with LSTM decoder
- model_2_image_segmentation_lstm/: Mask R-CNN encoder with LSTM decoder
- model_3_image_segmentation_attention_decoder/: Mask R-CNN encoder with attentive transformer decoder
- model_4_vision_transformer/: ViT encoder with GPT-2 decoder
data/: Data handling and preprocessing utilities
metrics.py: Evaluation metrics implementation (BLEU, METEOR, CIDEr)
conda.yml: Environment configuration

Citation

If you use this code for your research, please cite our paper:

@article{lee2023optimizing,
  title={Optimizing Encoder-Decoder Architectures for Image Captioning Tasks: An Analysis of Transfer Learning and Attention Mechanisms},
  author={Lee, Jed Woon Kiat and Koh, Quan Wei Ivan},
  journal={},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
app		app
archive		archive
data		data
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Image_Captioning_Project_Report___CS_7643.pdf		Image_Captioning_Project_Report___CS_7643.pdf
README.md		README.md
cfg.py		cfg.py
conda.yml		conda.yml
dataset_viz.ipynb		dataset_viz.ipynb
download_flickr30k.sh		download_flickr30k.sh
download_flickr8k.sh		download_flickr8k.sh
metrics.py		metrics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Captioning Project

Project Overview

Research Findings

Datasets

Setup

Demo Application

Running the Demo

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ivankqw/image-captioning-project

Folders and files

Latest commit

History

Repository files navigation

Image Captioning Project

Project Overview

Research Findings

Datasets

Setup

Demo Application

Running the Demo

Project Structure

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages