Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss breaks after reaching minima #28

Open
RSKothari opened this issue May 1, 2020 · 3 comments
Open

Loss breaks after reaching minima #28

RSKothari opened this issue May 1, 2020 · 3 comments

Comments

@RSKothari
Copy link

Would you happen to have any intuition on this?
I'm using a U-net style network (with skipped connections). Output -> 3 channels. The centre of mass, in my case, the pupil centre, is regressed from channel 1.

I use torch.sigmoid on channel 1 before giving it as input to weighted H loss and a sufficiently small learning rate (5e-5) with ADAM.

I observe that the loss reduces 0.03 -> 0.009 and the output starting to look as expected from channel 1, i.e, we start seeing the expected blob. Post convergence to a minima (which happens within 1 epoch), the loss goes its maximum (0.1 in my case) and stays there. I checked the gradient norms and found that there is a lot of fluctuation in the norm values. Furthermore, the loss is jumpy on every iteration.

Would you have an intuition about this?

@javiribera
Copy link
Owner

What do you use the two remaining output channels for? Maybe the loss function is bumpy because of whatever loss function you apply to those?

@RSKothari
Copy link
Author

Hi @javiribera , those two channels remain unbounded, i.e, I don't attach them to any loss function. I believe I need to provide a detailed report on the analysis I've done.

First observation: The larger the batchsize, more stable is the training. Minimum 16 BS was required by me to ensure convergence was stable for a longer time. Learning 5e-5.

Case A: Only wHauss.
wHauss when used without any loss functions - works only when activation function is sigmoid in a 3 channel segmentation output (2 are unbounded and are free to assume whatever they want). Sigmoid will work until minima is reached and then training crashes.
Softmax across all 3 fails spectacularly.

Case B: wHauss in a multi-task paradigm
When combined with other loss functions (if interested, please check https://arxiv.org/pdf/1910.00694.pdf) remains stable at minima. Interestingly, works well with small batchsizes and softmax.

Case C: wHauss with pretrained stable weights
When init training using pretrained weights, wHauss remains stable for a considerable portion of time although eventually it will crash after minima.

@javiribera
Copy link
Owner

First observation: The larger the batchsize, more stable is the training.
Well, this is true for any mini-batch SGD-based optimization.

Maybe this discussion helps: #2

I cannot help with segmentation tasks since I have never applied the WHD to that purpose and it was not the intention of the paper. This repository is the implementation of that paper and it is not intended to be an all-in-one code for other tasks.

So let's focus on your original case (0). The problem of interest is that when using it by itself, you see the WHD loss decrease in a very noisy manner. You mention it converges whithin 1 epoch, which seems very fast. I do remember that the WHD is noisy but never found it a huge problem. Do you see the same with SGD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants