Awesome Deep Learning Theory

Tutorials and other resources

1. ICML 2018 Tutorial: Toward Theoretical Understanding of Deep Learning

Material: Homepage, Slide

Notes:

Going through the role of: convexity, overparameterization, role of depth, unsupervised/GAN, simpler networks (compression)
Present many important phenomenons and simple theories.

2. ICML Invited Talk: Max Welling - Intelligence per Kilowatthour

Materials: Youtube (not from ICML)

Various views, not only on intelligence per Kilowatthour, but also philosophical question about computational world.

3. ICML 2018 Invited Talk: Josh Tenenbaum - Building Machines that Learn and Think Like People

Materials: paper - 55p, ICML Video

His view on the importance of Probabilistic Programming Language (BayesFlow, ProbTorch, WebPPL...)
Symbolic is good for abstract reasoning.
The inuitive Physics engine in our mind (even new-born)
The hard problem of learning: learning program is hard (difficult loss function)
Big question: How to connect Symbolic representations and Neural Nets?

Papers at ICML 2018

1. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

Materials: pdf, code

Guided by Michael Jordan
Instancewise feature selection as a methodology for model interpretation
Learn a function to extract a subset of features that are most informative for each given example
Develop an efficient variational approximation to the mutual information

2. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory

Materials: pdf, code

Nice combination between theory and practice
PAC-Bayes Bound (1999) -> PAC-Bayes Bound for Meta Learning (2014)
Stochastic Neural Nets with Posteriors and Priors are factorized Gaussians

3. Data-Dependent Stability of Stochastic Gradient Descent,

Materials: pdf

Work of Christoph Lampert, who develped PAC-Bayes for Meta Learning.
Establish a data-dependent notion of algorithmic stability for SGD
Develop novel generalization bounds

4. Stability and Generalization of Learning Algorithms that Converge to Global Optima

Materials: pdf

A Stable Learning Algorithm: Change in data doesn't affect model too much
Stability -> Generalization
Establish novel generalization bounds for learning algorithms that converge to global minima.
Derive black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the empirical risk function

5. An Alternative View: When Does SGD Escape Local Minima?

Materials: pdf, supp

Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern NNets.
To understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function.
Even if the function f has many bad local minima or saddle points, as long as for every point x, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution x∗, SGD will get close to, and then stay around x∗ with constant probability

6. Geometry Score: A Method For Comparing Generative Adversarial Networks

Materials: pdf, code

Biggest challenges in the research of GANs is assessing the quality of generated samples and detecting various levels of mode collapse
Propose a novel measure of performance of a GAN by comparing geometrical properties of the underlying data manifold and the generated one, which provides both qualitative and quantitative means for evaluation.

7. Compressing Neural Networks using the Variational Information Bottelneck

Materials: pdf, code

NNets can be compressed to reduce memory and computational requirements, or to increase accuracy by facilitating the use of a larger base architecture.
Focus on pruning individual neurons, which can simultaneously trim model size, FLOPs, and run-time memory.
Utilize the information bottleneck principle instantiated via a tractable variational bound

8. A Spline Theory of Deep Learning

Materials: pdf, supp

Build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators.
A large class of DNs can be written as a composition of max-affine spline operators (MASOs).
Deep Nets are (learned) matched filter banks.

9. Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions

Materials: pdf, supp

Important role of depth in deep learning has been emphasized
Argue that sufficient width of a feedforward network is equally important by answering the simple question under which conditions the decision regions of a neural network are connected.
Sufficiently wide hidden layer is necessary to guarantee that the network can produce disconnected decision regions

10. Stronger Generalization Bounds for Deep Nets via a Compression Approach

Materials: pdf, supp

Recent works use PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting
Use Noise stability -> Compress -> Better bounds
Data-dependent for generalization theory makes a stronger generalization bounds.

11. Invariance of Weight Distributions in Rectified MLPs

Materials: pdf, supp

An interesting approach to analyzing NNets that has received renewed attention is to examine the equivalent kernel of the neural network.
Fully connected feedforward network with one hidden layer, a certain weight distribution, an activation function, and an infinite number of neurons can be viewed as a mapping into a Hilbert space.
Derive the equivalent kernels of MLPs with ReLU or Leaky ReLU activations for all rotationally-invariant weight distributions

12. Understanding Generalization and Optimization Performance of Deep CNNs

Materials: pdf, supp

13. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

Materials: pdf, supp

CNN: quivariance: translation
Can we generalize to other action gorups: rotation?
More general notion of convolution on groups
Fourier analysis on groups: spherical CNNs, Steerability and CNNs for manifolds
NNets for Graphs: Messasage passing Neural Net.

14. Bounding and Counting Linear Regions of Deep Neural Networks

Materials: pdf, supp

Study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically.
Tighter upper and lower bounds for the maximum number of linear regions on rectifier networks.
Indicate that a deep rectifier network can only have more linear regions than every shallow counterpart with same number of neurons if that number exceeds the dimension of the input.

15. A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations

Materials: pdf, supp

Backpropagation-based visualizations have been proposed to interpret convolutional neural networks (CNNs), however a theory is missing to justify their behaviors.
Develop a theoretical explanation revealing that GBP and DeconvNet are essentially doing (partial) image recovery which is unrelated to the network decisions.

16. The Mechanics of n-Player Differentiable Games - Runner Up

Materials: pdf, supp

Context: Everything in DL (CNN, LSTM, Relu, Resnet...) depends on Gradient Descent to find (good) local minima. This fails in setting as of game for GAN model where there are multiple interacting losses.
Problem: This paper analyzes why GD fails in multiple interacting losses games like GAN and propose a way to modify/construct loss to make GD succeed.
Approach:
- Decompose the second-order dynamics into two components. The first is related to potential games, which reduces to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.
- Propose Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games.
Result: SGA is competitive for finding Local Nash equilibria in GANs.

Important Researchers （To be added）

Seminal Papers

Overparameterization:

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers: pdf(NeurIPS 2019)
This paper proves an important concept class that contains three-layer (resp. two-layer) neural networks equipped with smooth activations can be efficiently learned by three-layer (resp. two-layer) ReLU neural networks via SGD or its variants. It can be seen as the meta theoretical support of robustness guarantee of the wide neural networks.(Seminal)
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization: pdf(ICML 2018)
This paper suggests that the overparameterization via the increasing depth is beneficial to the convergence of the optimization in deep learning.
The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks: pdf(Journal of Pure and Applied Functional Analysis 2020)
This paper proves that for all three over-parameterized models including the random feature model, NN, and ResNet, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate.(Seminal)
How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective: pdf(NeurIPS 2018)
This paper provides a dynamic stability of SGD that batch-size and learning rate is important.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks: pdf(NeurIPS 2019)
This paper analyzes a simple 2-layer ReLU net with random initialization.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data: pdf(NeurIPS 2018)
This paper proves the generalization ability of SGD in the over-parameterized settings with the well-separated data distribution.
Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks: pdf( IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY 2020)
This paper utilizes the tools of random matrix theory and gives a bound of the spectrum of Hadamard matrices to prove the generalization ability of wide shallow neural networks. (Seminal)
How much over-parameterization is sufficient to learn deep ReLU networks?: pdf(ICLR 2021)
This paper inspects that a network width condition that is polylogarithmic in the sample size n and the inverse of target error $\varepsilon^{-1}$ is sufficient to guarantee the learning of deep ReLU networks.
Convex Formulation of Overparameterized Deep Neural Networks: pdf(Proceedings of the IEEE 2020)
This paper gives a survey on the mathematical model of over-parameterized DNNs.

Generalization:

Stronger generalization bounds for deep nets via a compression approach: pdf(ICML 2018)
This paper derives a $L_{2}$-norm generalization bound.
Understanding deep learning requires rethinking generalization: [pdf](ICLR 2017)
This paper inspects the influence of the explicit and implicit regularization on the generalization of deep learning from the theoretical and empirical perspective.
A Closer Look at Memorization in Deep Networks: pdf(ICML 2017)
This paper inpects the effect of data on the memorization of neural network.
What Neural Networks Memorize and Why: Discovering the Long Tail via Inﬂuence Estimation: pdf(NeurIPS 2020)
This paper inspects the relationship between the generalization and memorization that the ability of DNNS is built on the highest accuracy of clean data in the long-tailed distribution.
What Do Neural Networks Learn When Trained With Random Labels?: pdf(NeurIPS 2019)
This paper studies the alignment between the principal components of network parameters and data with regard to the random labels.
Exploring Generalization in Deep Learning: pdf(NeurIPS 2017)
This paper explains why the model capacity of CNN can be large enough to overfit any random label. Moreover, it gives a generalization bound from the PAC-Bayes theory.
Spectrally-normalized margin bounds for neural networks: pdf(NeurIPS 2017)
This paper analyzes the margin-normalized spectral complexity of deep learning.

Optimization:

Entropy-sgd: Biasing gradient descent into wide valleys: pdf(ICLR 2017)
This paper proposes a new optimization scheme that Langevin dynamics helps SGD converge to a global minma from the perspctive of energy function.
Train faster, generalize better: Stability of stochastic gradient descent: pdf(ICML 2016)
This paper anlayzes the stability of SGD from the Lipschitz condition.
Convergence Analysis of Two-layer Neural Networks with ReLU Activation: pdf(NeurIPS 2017)
This paper utilizes the method of “identity mapping” that SGD converges to the global minimum in polynomial number of steps with standard O(1/sqrt(d)) initialization of the weights.
Gradient descent optimizes over-parameterized deep ReLU networks: pdf(Springer Machine Learning Journal 2020)
This paper studies the problem of training deep fully connected neural networks with Rectiﬁed Linear Unit (ReLU) activation function and cross entropy loss function for binary classiﬁcation using gradient descent.

NTK:

Backward Feature Correction: How Deep Learning Performs Deep Learning: pdf(arxiv 2020)
This paper is long. It finds that training higher-level layers in the network can actually improve the features of lower-level ones.
Network size and weights size for memorization with two-layers neural networks: pdf(NeurIPS 2020)
This paper presents the construction of memorization neural networks with the optimal numbers of neurons.
A mean field view of the landscape of two-layer neural networks: pdf(PNAS 2018)
This paper study the SGD dynamics of two-layer neural networks from the perspective of PDE.

-Adversarial Examples:

A single gradient step finds adversarial examples on random two-layers neural networks: pdf(arxiv 2021)
This paper discusses adversarial examples on two-layers neural networks at (random) initialization.
A law of robustness for two-layers neural networks: pdf(COLT 2020)
This paper achieves the conjecture that the over-parameterization is significant to robustness that the O(1)-Lipschitz continuity is proved.
A Universal Law of Robustness via Isoperimetry: pdf(NeurIPS 2021)
This paper provides an insight that the over-parameterizaion is the guarantee of the robustness via the isoperimetry.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Deep Learning Theory

Tutorials and other resources

1. ICML 2018 Tutorial: Toward Theoretical Understanding of Deep Learning

2. ICML Invited Talk: Max Welling - Intelligence per Kilowatthour

3. ICML 2018 Invited Talk: Josh Tenenbaum - Building Machines that Learn and Think Like People

Papers at ICML 2018

1. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

2. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory

3. Data-Dependent Stability of Stochastic Gradient Descent,

4. Stability and Generalization of Learning Algorithms that Converge to Global Optima

5. An Alternative View: When Does SGD Escape Local Minima?

6. Geometry Score: A Method For Comparing Generative Adversarial Networks

7. Compressing Neural Networks using the Variational Information Bottelneck

8. A Spline Theory of Deep Learning

9. Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions

10. Stronger Generalization Bounds for Deep Nets via a Compression Approach

11. Invariance of Weight Distributions in Rectified MLPs

12. Understanding Generalization and Optimization Performance of Deep CNNs

13. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

14. Bounding and Counting Linear Regions of Deep Neural Networks

15. A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations

16. The Mechanics of n-Player Differentiable Games - Runner Up

Important Researchers （To be added）

Seminal Papers

About

Releases

Packages

momo1986/awesome-deep-learning-theory

Folders and files

Latest commit

History

Repository files navigation

Awesome Deep Learning Theory

Tutorials and other resources

1. ICML 2018 Tutorial: Toward Theoretical Understanding of Deep Learning

2. ICML Invited Talk: Max Welling - Intelligence per Kilowatthour

3. ICML 2018 Invited Talk: Josh Tenenbaum - Building Machines that Learn and Think Like People

Papers at ICML 2018

1. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation

2. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory

3. Data-Dependent Stability of Stochastic Gradient Descent,

4. Stability and Generalization of Learning Algorithms that Converge to Global Optima

5. An Alternative View: When Does SGD Escape Local Minima?

6. Geometry Score: A Method For Comparing Generative Adversarial Networks

7. Compressing Neural Networks using the Variational Information Bottelneck

8. A Spline Theory of Deep Learning

9. Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions

10. Stronger Generalization Bounds for Deep Nets via a Compression Approach

11. Invariance of Weight Distributions in Rectified MLPs

12. Understanding Generalization and Optimization Performance of Deep CNNs

13. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups

14. Bounding and Counting Linear Regions of Deep Neural Networks

15. A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations

16. The Mechanics of n-Player Differentiable Games - Runner Up

Important Researchers （To be added）

Seminal Papers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages