- The classic methods: high accuracy solutions, slowly
- Gaussian elimination, LU, ...
- Modern iterative methods: modest accuracy solutions, quickly
- Gradient descent: minimize linear expansion
- Stochastic gradient methods: sample and minimize linear expansion
- Issues: stepsizes, batches, interpolation….
- The set-up: inverting a nonlinear mapping
- Compressive sensing
- Phase retrieval
- Models, loss functions, regularization
- Minimizing prediction error
- Maximizing the likelihood of data
- Leveraging prior information
- Population vs empirical problems
- LQR
- Bandits
- Reinforcement learning
- Convexity, smoothness, stochasticity
- LP, QP, and SDP
- Software: CVXPY and others (e.g., https://developers.google.com/optimization/math_opt)
- Gradients, Hessians, and Taylor's theorem
- Gradient: direction of steepest descent
- First and second order optimality conditions
- Linear transformations/inner products
- Norms
- Jacobians and the chain rule
- Formal gradients of nondifferentiable functions
- Software:
- The details: Micrograd (https://github.com/karpathy/micrograd)
- Intro to PyTorch (possibly Jax)
- Architectures
- Colab
- GPUs
- Gradient descent
- Stochastic gradient descent
- Optimizing the population risk
- Convexity: finds global minima, has rates, can be accelerated
- Nonconvexity: finds critical points, has rates, can be accelerated in certain cases
- In limited cases e.g., NTK, we can minimize the training loss
- Adagrad, Adam, momentum, Nesterov acceleration,...
- https://pytorch.org/docs/stable/optim.html#algorithms
- Stepsize schedule
- Warm up, decay, cosine…
- Ill conditioning
- Interpolation
- Newton's, Gauss-Newton and quasi Newton methods
- Other preconditioners
- Natural gradient descent
- KFAC
- Shampoo (https://arxiv.org/pdf/2406.17748)
- Conjugate gradient
- GMRES
- CoLA software library (https://github.com/wilson-labs/cola)
- Zeroth order methods
- Mirror Descent
- Constraints and regularization
- Proximal and projected gradient methods
- Examples:
- Compressive sensing, low-rank matrix completion
- Alternating minimization style methods
- Examples
- Layer wise training of deep networks
- Sparse coding/dictionary learning
- Examples
- Optimization over low-rank matrices and tensors
- Burer Monteiro and Gauss-newton
- Distributed data: federated learning
- Sensitive data: differentially private gradient methods
- Basic principle: noise injection
- Curriculum learning
- Low precision optimizers
- Predictive models of performance: Mu_p and Scaling "laws"
- Deep learning tuning playbook (https://github.com/google-research/tuning_playbook)
- Benchmarking Neural Network Training Algorithms (https://arxiv.org/abs/2306.07179)
- Sampling text via transformers and MinGPT
- Sampling images with diffusion models