Skip to content

A curated list of awesome Deep Learning theories that shed light on the mysteries of DL

Notifications You must be signed in to change notification settings

momo1986/awesome-deep-learning-theory

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

Awesome Deep Learning Theory Awesome

Tutorials and other resources

Material: Homepage, Slide

Notes:

  • Going through the role of: convexity, overparameterization, role of depth, unsupervised/GAN, simpler networks (compression)
  • Present many important phenomenons and simple theories.

2. ICML Invited Talk: Max Welling - Intelligence per Kilowatthour

Materials: Youtube (not from ICML)

  • Various views, not only on intelligence per Kilowatthour, but also philosophical question about computational world.

3. ICML 2018 Invited Talk: Josh Tenenbaum - Building Machines that Learn and Think Like People

Materials: paper - 55p, ICML Video

  • His view on the importance of Probabilistic Programming Language (BayesFlow, ProbTorch, WebPPL...)
  • Symbolic is good for abstract reasoning.
  • The inuitive Physics engine in our mind (even new-born)
  • The hard problem of learning: learning program is hard (difficult loss function)
  • Big question: How to connect Symbolic representations and Neural Nets?

Papers at ICML 2018

Materials: pdf, code

  • Guided by Michael Jordan
  • Instancewise feature selection as a methodology for model interpretation
  • Learn a function to extract a subset of features that are most informative for each given example
  • Develop an efficient variational approximation to the mutual information

Materials: pdf, code

  • Nice combination between theory and practice
  • PAC-Bayes Bound (1999) -> PAC-Bayes Bound for Meta Learning (2014)
  • Stochastic Neural Nets with Posteriors and Priors are factorized Gaussians

Materials: pdf

  • Work of Christoph Lampert, who develped PAC-Bayes for Meta Learning.
  • Establish a data-dependent notion of algorithmic stability for SGD
  • Develop novel generalization bounds

Materials: pdf

  • A Stable Learning Algorithm: Change in data doesn't affect model too much
  • Stability -> Generalization
  • Establish novel generalization bounds for learning algorithms that converge to global minima.
  • Derive black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the empirical risk function

Materials: pdf, supp

  • Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern NNets.
  • To understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function.
  • Even if the function f has many bad local minima or saddle points, as long as for every point x, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution x∗, SGD will get close to, and then stay around x∗ with constant probability

Materials: pdf, code

  • Biggest challenges in the research of GANs is assessing the quality of generated samples and detecting various levels of mode collapse
  • Propose a novel measure of performance of a GAN by comparing geometrical properties of the underlying data manifold and the generated one, which provides both qualitative and quantitative means for evaluation.

Materials: pdf, code

  • NNets can be compressed to reduce memory and computational requirements, or to increase accuracy by facilitating the use of a larger base architecture.
  • Focus on pruning individual neurons, which can simultaneously trim model size, FLOPs, and run-time memory.
  • Utilize the information bottleneck principle instantiated via a tractable variational bound

Materials: pdf, supp

  • Build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators.
  • A large class of DNs can be written as a composition of max-affine spline operators (MASOs).
  • Deep Nets are (learned) matched filter banks.

Materials: pdf, supp

  • Important role of depth in deep learning has been emphasized
  • Argue that sufficient width of a feedforward network is equally important by answering the simple question under which conditions the decision regions of a neural network are connected.
  • Sufficiently wide hidden layer is necessary to guarantee that the network can produce disconnected decision regions

Materials: pdf, supp

  • Recent works use PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting
  • Use Noise stability -> Compress -> Better bounds
  • Data-dependent for generalization theory makes a stronger generalization bounds.

Materials: pdf, supp

  • An interesting approach to analyzing NNets that has received renewed attention is to examine the equivalent kernel of the neural network.
  • Fully connected feedforward network with one hidden layer, a certain weight distribution, an activation function, and an infinite number of neurons can be viewed as a mapping into a Hilbert space.
  • Derive the equivalent kernels of MLPs with ReLU or Leaky ReLU activations for all rotationally-invariant weight distributions

Materials: pdf, supp

Materials: pdf, supp

  • CNN: quivariance: translation
  • Can we generalize to other action gorups: rotation?
  • More general notion of convolution on groups
  • Fourier analysis on groups: spherical CNNs, Steerability and CNNs for manifolds
  • NNets for Graphs: Messasage passing Neural Net.

Materials: pdf, supp

  • Study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically.
  • Tighter upper and lower bounds for the maximum number of linear regions on rectifier networks.
  • Indicate that a deep rectifier network can only have more linear regions than every shallow counterpart with same number of neurons if that number exceeds the dimension of the input.

Materials: pdf, supp

  • Backpropagation-based visualizations have been proposed to interpret convolutional neural networks (CNNs), however a theory is missing to justify their behaviors.
  • Develop a theoretical explanation revealing that GBP and DeconvNet are essentially doing (partial) image recovery which is unrelated to the network decisions.

Materials: pdf, supp

  • Context: Everything in DL (CNN, LSTM, Relu, Resnet...) depends on Gradient Descent to find (good) local minima. This fails in setting as of game for GAN model where there are multiple interacting losses.
  • Problem: This paper analyzes why GD fails in multiple interacting losses games like GAN and propose a way to modify/construct loss to make GD succeed.
  • Approach:
    • Decompose the second-order dynamics into two components. The first is related to potential games, which reduces to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.
    • Propose Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games.
  • Result: SGA is competitive for finding Local Nash equilibria in GANs.

Important Researchers (To be added)

Seminal Papers

  • Overparameterization:
  1. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers: pdf(NeurIPS 2019)
    This paper proves an important concept class that contains three-layer (resp. two-layer) neural networks equipped with smooth activations can be efficiently learned by three-layer (resp. two-layer) ReLU neural networks via SGD or its variants. It can be seen as the meta theoretical support of robustness guarantee of the wide neural networks.(Seminal)
  2. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization: pdf(ICML 2018)
    This paper suggests that the overparameterization via the increasing depth is beneficial to the convergence of the optimization in deep learning.
  3. The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks: pdf(Journal of Pure and Applied Functional Analysis 2020)
    This paper proves that for all three over-parameterized models including the random feature model, NN, and ResNet, the generalization error for the minimum-norm solution is comparable to the Monte Carlo rate.(Seminal)
  4. How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective: pdf(NeurIPS 2018)
    This paper provides a dynamic stability of SGD that batch-size and learning rate is important.
  5. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks: pdf(NeurIPS 2019)
    This paper analyzes a simple 2-layer ReLU net with random initialization.
  6. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data: pdf(NeurIPS 2018)
    This paper proves the generalization ability of SGD in the over-parameterized settings with the well-separated data distribution.
  7. Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks: pdf( IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY 2020)
    This paper utilizes the tools of random matrix theory and gives a bound of the spectrum of Hadamard matrices to prove the generalization ability of wide shallow neural networks. (Seminal)
  8. How much over-parameterization is sufficient to learn deep ReLU networks?: pdf(ICLR 2021)
    This paper inspects that a network width condition that is polylogarithmic in the sample size n and the inverse of target error $\varepsilon^{-1}$ is sufficient to guarantee the learning of deep ReLU networks.
  9. Convex Formulation of Overparameterized Deep Neural Networks: pdf(Proceedings of the IEEE 2020)
    This paper gives a survey on the mathematical model of over-parameterized DNNs.
  • Generalization:
  1. Stronger generalization bounds for deep nets via a compression approach: pdf(ICML 2018)
    This paper derives a $L_{2}$-norm generalization bound.
  2. Understanding deep learning requires rethinking generalization: [pdf](ICLR 2017)
    This paper inspects the influence of the explicit and implicit regularization on the generalization of deep learning from the theoretical and empirical perspective.
  3. A Closer Look at Memorization in Deep Networks: pdf(ICML 2017)
    This paper inpects the effect of data on the memorization of neural network.
  4. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation: pdf(NeurIPS 2020)
    This paper inspects the relationship between the generalization and memorization that the ability of DNNS is built on the highest accuracy of clean data in the long-tailed distribution.
  5. What Do Neural Networks Learn When Trained With Random Labels?: pdf(NeurIPS 2019)
    This paper studies the alignment between the principal components of network parameters and data with regard to the random labels.
  6. Exploring Generalization in Deep Learning: pdf(NeurIPS 2017)
    This paper explains why the model capacity of CNN can be large enough to overfit any random label. Moreover, it gives a generalization bound from the PAC-Bayes theory.
  7. Spectrally-normalized margin bounds for neural networks: pdf(NeurIPS 2017)
    This paper analyzes the margin-normalized spectral complexity of deep learning.
  • Optimization:
  1. Entropy-sgd: Biasing gradient descent into wide valleys: pdf(ICLR 2017)
    This paper proposes a new optimization scheme that Langevin dynamics helps SGD converge to a global minma from the perspctive of energy function.
  2. Train faster, generalize better: Stability of stochastic gradient descent: pdf(ICML 2016)
    This paper anlayzes the stability of SGD from the Lipschitz condition.
  3. Convergence Analysis of Two-layer Neural Networks with ReLU Activation: pdf(NeurIPS 2017)
    This paper utilizes the method of “identity mapping” that SGD converges to the global minimum in polynomial number of steps with standard O(1/sqrt(d)) initialization of the weights.
  4. Gradient descent optimizes over-parameterized deep ReLU networks: pdf(Springer Machine Learning Journal 2020)
    This paper studies the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent.
  • NTK:
  1. Backward Feature Correction: How Deep Learning Performs Deep Learning: pdf(arxiv 2020)
    This paper is long. It finds that training higher-level layers in the network can actually improve the features of lower-level ones.
  2. Network size and weights size for memorization with two-layers neural networks: pdf(NeurIPS 2020)
    This paper presents the construction of memorization neural networks with the optimal numbers of neurons.
  3. A mean field view of the landscape of two-layer neural networks: pdf(PNAS 2018)
    This paper study the SGD dynamics of two-layer neural networks from the perspective of PDE.

-Adversarial Examples:

  1. A single gradient step finds adversarial examples on random two-layers neural networks: pdf(arxiv 2021)
    This paper discusses adversarial examples on two-layers neural networks at (random) initialization.
  2. A law of robustness for two-layers neural networks: pdf(COLT 2020)
    This paper achieves the conjecture that the over-parameterization is significant to robustness that the O(1)-Lipschitz continuity is proved.
  3. A Universal Law of Robustness via Isoperimetry: pdf(NeurIPS 2021)
    This paper provides an insight that the over-parameterizaion is the guarantee of the robustness via the isoperimetry.

About

A curated list of awesome Deep Learning theories that shed light on the mysteries of DL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published