Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activation Problem #26

Open
BilgehanSel opened this issue Jul 31, 2018 · 14 comments
Open

Activation Problem #26

BilgehanSel opened this issue Jul 31, 2018 · 14 comments

Comments

@BilgehanSel
Copy link

Sorry to bother if I'm wrong but, it seems like there is no activation between convolution layers...

@tbennun
Copy link
Owner

tbennun commented Aug 1, 2018

You seem to be right. If you add it, I'll be happy to accept a pull request. Otherwise, I currently don't have time to work on this, but maybe later this year.

@ghost
Copy link

ghost commented Aug 22, 2018

But it does not really differ from the results I got before without these activations.
I put the activations directly after the pooling layers. Should they maybe be somewhere else?
What would be good alpha + beta values for them ?

Btw, your example code is great!

@tbennun
Copy link
Owner

tbennun commented Aug 22, 2018

Thanks, and thank you for trying to fix it. 👍 1.8% is much better than 8% error.

As far as I know, the activation has to be performed after the convolution, not the pooling layers. In the Tensorflow tutorial, they also change the learning rate to 0.001 and add a Dropout layer after the first dense (fully-connected) layer. Even with that, they achieve 97.3% test accuracy: https://www.tensorflow.org/tutorials/estimators/cnn

It should be possible to get 0.8% error with this network, but it may also require to change the optimizer from standard SGD to Momentum SGD or Adam.
Since I didn't want to make the example complicated, I left the optimizer as simple as possible. What do you think?

@ghost
Copy link

ghost commented Aug 22, 2018

Thanks for the information and the link to the tutorial. Ok, so I will post some new code which handles the activation after the convolution bias layer. In the tutorial there is no bias used.

On this page http://cs231n.github.io/neural-networks-1/ I found the information, that the bias is added and then after that the activation function is applied. So I will applied the activation directly before the pooling (instead of after the pooling). Added also also DropOut Layer.
The 1.8% I got because of 10000 iterations instead of 1000 iterations also without the activation function.

Its great to have a simple example, but some optional improvements (separated by #ifdef .. #endif) would be great too, so that one knows, what to change to move from SGD to Nesterov’s Accelerated Momentum (NAG).
I'm not sure, whether I understood it right. the "UpdateWeights" function is the "optimiser", right? So for Nesterov I would have to change all of the cublasSaxpy(..) calls with some math operation (using a CUDA kernel)

NAG:

v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form

(found on http://cs231n.github.io/neural-networks-3/)
for momentum I use 0.9

on a Momentum SGD :

    v[i] = mu*v[i] + learning_rate * gradient[i]
    weights[i] += v[i]

I implemented Momentum SGD already with mu=0.9 and it gives similar results, but learning_rate now must be lower.. it has a high error on learning rate 0.01 in my implementation.

@ghost
Copy link

ghost commented Aug 22, 2018

EDIT: added a pull-request instead. so I removed the code here

Training dataset size: 60000, Test dataset size: 10000
Batch size: 32, iterations: 20000
Classification result: 1.51% error (used 10000 images)

Training dataset size: 60000, Test dataset size: 10000
Batch size: 32, iterations: 100000
Classification result: 0.98% error (used 10000 images)

Training dataset size: 60000, Test dataset size: 10000
Batch size: 32, iterations: 200000
Classification result: 0.91% error (used 10000 images)

so a faster convergence with Nesterov’s Accelerated Momentum would really help to reduce the high amount of iterations!

@ghost
Copy link

ghost commented Aug 23, 2018

And I applied the DropOut Layer (directly after the FullyConnected1 Layer) using this code:
https://devtalk.nvidia.com/default/topic/1028240/cudnn/how-to-implement-a-dropout-layer-using-cudnn-/

Training dataset size: 60000, Test dataset size: 10000 Batch size: 32 DropOut Rate = 0.4
iterations: 500000 Classification result: 0.84% error (used 10000 images)
iterations: 200000 Classification result: 0.86% error (used 10000 images)
iterations: 100000 Classification result: 0.93% error (used 10000 images)
iterations: 10000 Classification result: 1.72% error (used 10000 images)

changing the learning rate to 0.001 did not work for me. error was even increasing.
so next I will try to change the SGD to Nesterov’s Accelerated Momentum.

@ghost
Copy link

ghost commented Aug 23, 2018

I finally got NAG + SGD Momentum working.

NesterovMomentumWeightUpdate Momentum=0.9 Learning Rate: 0.001
Training dataset size: 60000, Test dataset size: 10000 Batch size: 32,
LEARNING_RATE_POLICY_GAMMA 0.0001
LEARNING_RATE_POLICY_POWER 0.75
iterations: 100000 Classification result: 0.80% error (used 10000 images)
iterations: 20000 Classification result: 1.32% error (used 10000 images)
iterations: 10000 Classification result: 1.86% error (used 10000 images)

And I noticed this:
NesterovMomentumWeightUpdate Momentum=0.9
Training dataset size: 60000, Test dataset size: 10000 Batch size: 32,
LEARNING_RATE 0.005
LEARNING_RATE_POLICY_GAMMA 0.00001
LEARNING_RATE_POLICY_POWER 0.8
iterations: 1000 Classification result: 3.43% error (used 10000 images)
iterations: 10000 Classification result: 88.65% error (used 10000 images)
iterations: 20000 Classification result: 88.20% error (used 10000 images)

Any idea why this happens?

CUDA kernel:

__global__ void NesterovMomentumWeightUpdate(float *weights,  float *gradients, float *v, float learning_rate,  int size)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= size)
        return;

    float v_prev = v[idx];
    const float MomentumUpdate = 0.9f;
    v[idx] = MomentumUpdate * v[idx] + learning_rate * gradients[idx];
                    // here + learning_rate (cause its already negated)
    weights[idx] += -MomentumUpdate * v_prev + (1.0f + MomentumUpdate) * v[idx];

#if 0  // TEST ONLY    SGD Momentum
    const float MomentumUpdate = 0.9f;
    v[idx] = MomentumUpdate * v[idx] + learning_rate * gradients[idx];
    weights[idx] += v[idx];
#endif

#if 0 // TEST ONLY   (same as calling cublasSaxpy(...))
    // pure SGD:
    float  v0  = learning_rate * gradients[idx];
    weights[idx] += v0;
#endif
}

@tbennun
Copy link
Owner

tbennun commented Aug 23, 2018

I don't know why the gradients explode, and I can't go over your code in this form. It's hard to read without a proper diff and hard to test when I need to apply your changes. Please create a pull request.

To create a pull request, first you have to fork the repository through Github, then commit your changes and push them to your Github. At this point Github will ask if you want to create a pull request if you browse to your version of the repo. If not, you can still go to my version, click "Pull Requests" and create a new one from there.
Please refer to the official guide for more information: https://help.github.com/articles/creating-a-pull-request/

@ghost
Copy link

ghost commented Aug 23, 2018

I opened a pull-request. only the commits of lenet.cu (RElu, Nestorov, DropOut, Adam) are the ones I wanted to submit in there, but all other commits seem to be in the pull-request, too. so please ignore the others. I do not use git command line, so I can not change the pull request.

@BilgehanSel
Copy link
Author

I know that this repository is about showing the features of the CUDNN library but still, OOP style is needed...
Here is my version, easier to understand since layers are divided into their own classes.
https://github.com/BilgehanSel/SelCNN

@ghost
Copy link

ghost commented Sep 9, 2018

@BilgehanSel
However, in your SoftmaxLossBackprop() function you do a different handling as in the original code. Why? Do you yield better results than 0.80 with that code on the MNIST dataset in less than 10000 iterations?
You do softmax twice. One time on that last FullyConnectedLayer and one time on the output. The original code does it only on the last fullyconnected layer.

@tbennun
Copy link
Owner

tbennun commented Sep 9, 2018

@BilgehanSel @3DdeepAI While these are both excellent examples of how to train with CUDNN, in my opinion they're missing the point of this sample. SelCNN actually starts looking like Caffe in its early days, and this is what I wanted to avoid with this repository. I wanted to create a concise, clear example of how CUDNN can be used for training, in one file. Supporting all the bells and whistles that come along is what frameworks are for.

This is also why I have not yet merged the PR as is. I think the activation part should be added, but all the extra stuff is making the sample too heavy IMO. Unfortunately I'm too busy to do it right now, but when I have time, I'll take parts of that PR and integrate them, if that's OK.

@ghost
Copy link

ghost commented Sep 9, 2018

@tbennun OK, you're right a simple sample should not have all the other stuff. So I closed the PR's and created a new one with RELU activations only
The latest commit there contains all necessary changes:
https://github.com/3DdeepAI/3DdeepAI/commit/d7764241ba357ca0ec581fa726fa75291e97017c#diff-8526a070794ac85f9da83e9dbf728cbf
simply please ignore all other commits.

UPDATE:
new PR: #30 with one commit

@tbennun
Copy link
Owner

tbennun commented Sep 9, 2018

@3DdeepAI thank you for understanding, and thank you for taking the time to create another PR. 👍
I'll take a look at it soon and we can keep discussing it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants