Skip to content

Latest commit

 

History

History
458 lines (328 loc) · 14.4 KB

File metadata and controls

458 lines (328 loc) · 14.4 KB

ZJU Artificial Intelligence Safety HW01

Table of contents

Introduction

This repository the homework 1 for the Artificial Intelligence Safety course at Zhejiang University.

Instructor: Jie Song

Requirements:

  • Creating your own github account.
  • Implementing your own deep neural network (in Pytorch, PaddlePaddle...).
  • Training it on CIFAR10.
  • Tuning a hyper-parameter and analyzing its effects on performance.
  • Writing a README.md to report your findings.

Dataset

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

Experimental notes

Experiment environment

  • CPU: E5-2630L v3
  • GPU: RTX3060
  • RAM: 30GB
  • VRAM: 12GB

Requirements

  • python3
  • pytorch
  • numpy
  • matplotlib
  • tqdm

Model

I use the model from PyTorch tutorial for CIFAR-10. This model contains 2 convolution layers, 1 pooling layer and 3 fully-connected layers.

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(nn.functional.relu(self.conv1(x)))
        x = self.pool(nn.functional.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Training result

Cross entropy is chosen as the loss function for model training.

First I used SGD(Stochastic gradient descent) as the optimizer. The learning rate is set to 0.01. The batch size is set to 128.

The test set is referred to as the validation set here.

The acc@1 of test set is 67.04%.

The learning curve is shown below.

With the increase of epoch, train loss decreased continuously, but test loss decreased first and then increased. The performance was the best at about 70 rounds. Before 70 rounds, it was under-fitting, and after 70 rounds, it was over-fitting. As the training goes on, the complexity of the model will increase, and over-fitting is easy to occur when the model is overtrained.

The number of epochs is a hyperparameter, too large may lead to over-fitting. Methods such as early stopping and weight decay can be adopted to avoid this situation.

Tuning

1. Learning rate

The initial learning rate was set to a very small value (1e-5) and then exponentially increased at each step (1.05 times) for training. I trained for 236 steps(1.05^236*1e-5 = 1.002), and we can see that the train loss first decreases and then increases, just like a hook.

I chose a learning rate of 0.1 and started training again(It is best to choose the learning rate when loss decreases fastest).

The acc@1 of test set is 61.62%. The learning curve is shown below.

Compared with the results when the learning rate was 0.01, we can see that the convergence process is much faster, the optimum is reached in about 10 steps. However, the accuracy is lower.

Small learning rate will slow down the convergence, and large learning rate may be difficult to make the model reach the optimal.

2. Momentum and weight decay

Momentum is an optimization of the gradient descent. Every movement considers all past gradients.

Momentum can help escape critical points (saddle points or local minima) and accelerate convergence.

Weight decay is L2 regularization, which reduces the problem of over-fitting.

I updated the optimizer parameters (momentum and weight_decay here are also hyper-parameters) and started training again.

optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9, weight_decay=5e-4)

The results are as follows. The test result is 61.45%.

Compared to the previous results, the convergence speed becomes faster (because of momentum), and the problem of over-fitting is also alleviated (because of weight decay).

3. Change optimizer

Then I changed the optimizer to Adam, which combines RMSprop and Momentum.

The acc@1 of test set is 65.60%.

4. Batch size

I also tried different batch sizes.

Batch size: 32. Acc@1: 64.77%. Elapsed time: 1389s.

Batch size: 64. Acc@1: 64.76%. Elapsed time: 845s.

Batch size: 128. Acc@1: 66.37%. Elapsed time: 552s.

Batch size: 256. Acc@1: 64.54%. Elapsed time: 491s.

Batch size Acc@1 Time
32 64.77% 1389s
64 64.76% 845s
128 66.37% 552s
256 64.54% 491s

Theoretically, small batch size is easier to optimize and has better generalization ability, and with a larger batch size, each epoch of training takes less time and training is faster.

In this case, the batch size has no effect on the test acc, the possible reason is that all of them have reached the optimal value.

5. Make model more complex

It can be seen that the optimization effect is not obvious after adjusting parameters. Train loss and test loss are both relatively high, which should be Model Bias.

Next, I used some more complex models.

First, I implemented my own CNN, mainly by increasing output channals of the convolutional layer and adding Batch normalization.

class MyCNN(nn.Module):
    def __init__(self):
        super(MyCNN, self).__init__()
        # The arguments for commonly used modules:
        # torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
        # torch.nn.MaxPool2d(kernel_size, stride, padding)

        # input image size: [3, 32, 32]
        self.cnn_layers = nn.Sequential(
            nn.Conv2d(3, 64, 3, 1, 1),  # input: 32*32*3, output: 32*32*64
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, 0),  # input: 32*32*64, output: 16*16*64

            nn.Conv2d(64, 128, 3, 1, 1),  # input: 16*16*64, output: 16*16*128
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, 0),  # input: 16*16*128, output: 8*8*128
        )
        self.fc_layers = nn.Sequential(
            nn.Linear(128 * 8 * 8, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        # input (x): [batch_size, 3, 32, 32]
        # output: [batch_size, 10]

        # Extract features by convolutional layers.
        x = self.cnn_layers(x)

        # The extracted feature map must be flatten before going to fully-connected layers.
        x = x.flatten(1)

        # The features are transformed by fully-connected layers to obtain the final logits.
        x = self.fc_layers(x)
        return x

The acc@1 on test set is 76.37%, better than before.

I continued to increase the complexity of the model.

class MyCNN(nn.Module):
    def __init__(self):
        super(MyCNN, self).__init__()
        # The arguments for commonly used modules:
        # torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
        # torch.nn.MaxPool2d(kernel_size, stride, padding)

        # input image size: [3, 32, 32]
        self.cnn_layers = nn.Sequential(
            nn.Conv2d(3, 64, 3, 1, 1),  # input: 32*32*3, output: 32*32*64
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, 0),  # input: 32*32*64, output: 16*16*64

            nn.Conv2d(64, 128, 3, 1, 1),  # input: 16*16*64, output: 16*16*128
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, 0),  # input: 16*16*128, output: 8*8*128
            
            nn.Conv2d(128, 256, 3, 1, 1),  # intput: 8*8*128, output: 8*8*256
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(2, 2, 0),  # output: 4*4*256
        )
        self.fc_layers = nn.Sequential(
            nn.Linear(256 * 4 * 4, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        # input (x): [batch_size, 3, 32, 32]
        # output: [batch_size, 10]

        # Extract features by convolutional layers.
        x = self.cnn_layers(x)

        # The extracted feature map must be flatten before going to fully-connected layers.
        x = x.flatten(1)

        # The features are transformed by fully-connected layers to obtain the final logits.
        x = self.fc_layers(x)
        return x

The accuracy improved to 80.66%.

I also tried other CNN models.

  • ResNet-18

  • ResNet-50

  • ResNet-152

  • VGG-19 (failed)

  • VGG-19 with batch normalization

  • Densenet-121

  • Densenet-201

Model Acc@1 Time
My CNN 80.66% 679s
ResNet-18 76.64% 1766s
ResNet-50 78.59% 3859s
ResNet-152 77.95% 8140s
VGG-19 with BN 85.43% 4990s
Densenet-121 80.03% 7273s
Densenet-201 79.73% 10902s

I didn't change the optimizer and its parameters, so the results are subject to error.

6. Data augmentation

Loss on training data is small but loss on testing data is large, its probably over-fitting.

I use data augmentation to alleviate this problem.

transform_train = transforms.Compose([
    transforms.RandomCrop(32,padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean = [ 0.4914, 0.4822, 0.4465 ],
                         std  = [ 0.2023, 0.1994, 0.2010 ]),
    ])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean = [ 0.4914, 0.4822, 0.4465 ],
                         std  = [ 0.2023, 0.1994, 0.2010 ]),
    ])
  • My CNN

  • ResNet-18

  • ResNet-50

  • ResNet-152

  • VGG-19 (failed)

  • VGG-19 with batch normalization

  • Densenet-121

  • Densenet-201

Model Acc@1 Time
My CNN 86.81% 692s
ResNet-18 84.90% 3597s
ResNet-50 85.67% 7803s
ResNet-152 86.02% 19065s
VGG-19 with BN 89.49% 10036s
Densenet-121 85.86% 13772s
Densenet-201 86.44% 22602s

I didn't change the optimizer and its parameters, so the results are subject to error.

We can see that the acc@1 on testing data is improved, and the data augmentation does alleviate the problem of over-fitting.

For more details, see the ZJU Artificial Intelligence Safety HW01.ipynb

Reference

Security

Security Status