This repository contains the implementation of SSD using PyTorch. The implementation is based on the paper Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman. Other references used to understand the original architecture of VGG-16D are: GFG VGG-16 Model, VGG and ImageNet
SSD is a Single Shot MultiBox Detector, which is a state-of-the-art object detection framework. It is a deep learning model that can be used to detect objects in images and videos. It is based on the Single Shot Detector (SSD) framework, which uses a single deep neural network to predict bounding boxes and class scores for multiple objects in a single image. SSD is a popular object detection framework that is used in many computer vision applications, such as self-driving cars, face detection, and object tracking.
The backbone model network VGG-16 is used for feature extraction. The feature maps are then passed through a series of convolutional layers to generate the final output. The final output is a set of bounding boxes and class scores for each object in the image.
The following has been used as a reference for the implementation of the VGG-16D model:
Here, instead of using Conv3D, Conv2D is used to extract the features from the input image. The input image is passed through a series of convolutional layers to generate the feature maps. The feature maps are then passed through a series of convolutional layers to generate the final output. The final output is a set of bounding boxes and class scores for each object in the image.
The difference won't be much for image training, since there is no temporal context needed between independent images. However, for video training, the Conv3D will be more efficient, since it can extract the features from the input video with the temporal context.
Reference : https://stats.stackexchange.com/questions/296679/what-does-kernel-size-mean
Conv2D | Conv3D |