Skip to content

VGG16D Implementation in PyTorch from scratch to understand the algorithm better and trying for modifications in it.

Notifications You must be signed in to change notification settings

Rushour0/ssd-implementation

Repository files navigation

SSD Implmentation using PyTorch

Introduction

This repository contains the implementation of SSD using PyTorch. The implementation is based on the paper Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman. Other references used to understand the original architecture of VGG-16D are: GFG VGG-16 Model, VGG and ImageNet

What is SSD?

SSD is a Single Shot MultiBox Detector, which is a state-of-the-art object detection framework. It is a deep learning model that can be used to detect objects in images and videos. It is based on the Single Shot Detector (SSD) framework, which uses a single deep neural network to predict bounding boxes and class scores for multiple objects in a single image. SSD is a popular object detection framework that is used in many computer vision applications, such as self-driving cars, face detection, and object tracking.

Implementation

The backbone model network VGG-16 is used for feature extraction. The feature maps are then passed through a series of convolutional layers to generate the final output. The final output is a set of bounding boxes and class scores for each object in the image.

The following has been used as a reference for the implementation of the VGG-16D model:

Table

Here, instead of using Conv3D, Conv2D is used to extract the features from the input image. The input image is passed through a series of convolutional layers to generate the feature maps. The feature maps are then passed through a series of convolutional layers to generate the final output. The final output is a set of bounding boxes and class scores for each object in the image.

The difference won't be much for image training, since there is no temporal context needed between independent images. However, for video training, the Conv3D will be more efficient, since it can extract the features from the input video with the temporal context.

Conv2D v/s Conv3D

Reference : https://stats.stackexchange.com/questions/296679/what-does-kernel-size-mean

Conv2D Conv3D
Conv2D Conv3D

About

VGG16D Implementation in PyTorch from scratch to understand the algorithm better and trying for modifications in it.

Topics

Resources

Stars

Watchers

Forks

Languages