Home

Welcome to the rethinking-sparse-learning wiki!

TODO

Nov 23rd to Nov 30th:

Hyperparam tuning - [x] Alpha, Delta T - [x] Optuna, use 15 trials, 3 jobs in parallel - [x] Maximise val_accuracy - [x] Use single DB, different study names - [x] Plot should be of test
```
 - [x]  Learning Rate
     - [x]  Plot for each sparsity across 4 alpha, Delta T.
         - [x]  $(\alpha, \Delta T) = (0.3, 100), (0.4,200), (0.4, 500), (0.5,750)$
```
- CIFAR 10 Reporting
  - Script W&B metrics
  - Plot
  - table
  - Longer 2x runs
    - RigL, RigL ERK
    - SET Nov 28, 2020
    - SNFS
    - Static
- FLOP Counting
  - Adapt https://github.com/google-research/rigl/blob/master/rigl/imagenet_resnet/colabs/Resnet_50_Param_Flops_Counting.ipynb for our code + Wide Resnet

Dec 1st to 14th
- CIFAR10
  - Plots
  - Hyperparam plots
- Mini-Imagenet
  - Dataloader
  - Which runs?
  - Dense
    - Do we need linear warmup & fancy tricks?
- Extensions
  - Distributions. Evaluate ERK vs Uniform on computation
  - Dynamic Structured Sparsity
  - Effect of accumulation gradient
  - Effect of redistribution
    - Can ERK be a proxy? ie., avoid redistribution, use ERK instead.
    - Need to show no gains for ERK by redistribution
    - And some for random
    Experiments:
    
    \begin{itemize} \item RigL Random \item RigL Random with gradient re-distribution \item RigL Random with momentum re-distribution \item RigL Random with final static distribution found above \item RigL ERK \item RigL ERK with distribution \end{itemize}
    
    \vscomment{Question: Is the effect of redistribution to find a better power-law distribution? Question: Is the found distribution even power-law?
  - Ablation CAM: how do sparse nets see?