Author: Erick Escobar Gallardo
Email: [email protected]
Date: 23/06/21
The project consist in implementation of an Image Classifier using Deep Neural Networks (CNNs).
Use the package manager pip. To install all the requirements, execute the following command:
pip install requirements.txt
In order to modify the training and validation constants as well as the directories, it is necessary to modify the
information in the utils.py
file. The execution process of the project can be divided in the following phases:
- Modification of directory paths in
utils.py
accordingly. - Execution of
create_dataset.py
to create a smaller sample of the dataset and store it in./data
. This script will split the sample into a training and testing parts, each one with its respective dataframe. - Modification of training settings in
utils.py
to set up the CNN architecture, the number of epochs, etc. - Execution of training_job.sh using the command
qsub training_job.sh
to start the training of the CNN model . The training job will create a checkpoint inside the folder.\checkpoints
. - Execution of
evaluation_job.sh
using the commandqsub evaluation_job.sh
to execute the evaluation pipeline.
mpiexec -n 2 python -m mpi4py main.py
checkpoints: C:\Users\erick.cache\torch\hub\checkpoints
We used different pre-defined Pytorch Computer Vision Architectures, among these architectures are: resnet18, resnet34, alexnet, vgg, squeezenet, densenet, inception.
The PyTorch parallelism is disabled using 'torch.set_num_threads(1)'. For this task a well structured training model is defined. To reduce training time, we can set the constant DEBUG to True that will take a sample of the original training dataset and use it to train the selected CNN architecture.
In order to distribute the training process, first we scatter the dataset to all the nodes. For this me use MPI.Scatter to distribute the dataset among all the nodes. The dataset is split equally among all the processing nodes.
The distributed training process is done using the method MPI Allreduce that reduces (applies a SUM operation) to gradients of each process. Each process the averages the sum according to the total number processes.
For the pipelining of the testing procedure. We use a simple approach that pipeline the process of reading an image, resize the image, preprocesses the image (normalize it) and input the image tensor to the model. This pipeline takes into account the total number of processes, where the first 3 processes are used for the first 3 task, and the rest of the processes are in charge of the model prediction part.
IMPORTANT: Remember to first start the training process for an architecture in order to create a checkpoint that will be used for the pipeline evaluation process.
For this task we used 2 different CNN architectures, each for 10 epochs.
Model name | Validation score | Testing Score |
---|---|---|
Resnet 34 | 0.626262 | 0.1123 |
Resnet 18 | 0.8862 | 0.1962 |
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
Link to GitHub: https://github.com/erick093/MPI_Pytorch