Skip to content

Mini Project for Btech which helps the visually impaired person to get the idea of what is going in the image and describe the image as a audio to blind people.

Notifications You must be signed in to change notification settings

harshwalia36/Audio-Description-of-Image-for-visually-impaired-person

Repository files navigation

Audio-Description-of-Image-for-visually-impaired-person

  • Version-1 - Open In Colab
  • Version-2 - Open In Colab

Idea behind the project is to convert the image -> caption -> audio

I have implemented the image captioning with 2 approaches-

V1). Implemented CNN-RNN(LSTM) Architecture to convert the image into caption which acheived a LOSS of 2.689 on Flickr 8K dataset.
V2). Implemented CNN-RNN Architecture with Attention Mechanism to acheive better accuracy .Used a Larger MSCOCO Dataset of 327437 sample images which acheived a LOSS of 1.625.

Simple CNN-RNN Architecute

image


Now, How the 2nd approach using Attention Mechanism improved the model


image

MATHS Behind the Attention Mechanism

Local Attention As Global attention focus on all source side words for all target words, it is computationally very expensive and is impractical when translating for long sentences. To overcome this deficiency local attention chooses to focus only on a small subset of the hidden states of the encoder per target word.

Every location of convolution layers corresponds to some location of image as shown below.

image

Taking an example

Now, for example, the output of the 5th convolution layer of Inception is a 14 * 14 * 512 size feature map. This 5th convolution layer has 14*14 pixel locations which corresponds to certain portion in image, that means we have 196 such pixel locations. And finally, we can treat these 196 locations(each having 512 dimensional representation) .

The model will then learn an attention over these locations(which in turn corresponds to actual locations in the images).

image

Let’s discuss equations for Local Attention and Global Attention with General score :

image

image

Some Models predictions on test dataset

These descriptions are converted to audio in CODE

image pizza with pier and paper on it

image plane bear with sized zebras on it

image zebra standing next to car in batter

image woman in lamb group is holding skis

image display case filled with lots of different kinds of donuts


image wall mens truck is parked in grass


image black and white cat standing in his of patch phones

About

Mini Project for Btech which helps the visually impaired person to get the idea of what is going in the image and describe the image as a audio to blind people.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published