Skip to content

Latest commit

 

History

History
268 lines (244 loc) · 11.2 KB

cs434.md

File metadata and controls

268 lines (244 loc) · 11.2 KB

$$ \def\v{\vec} \def\mean#1{\langle#1\rangle} \def\abs#1{,|#1|,} $$

CS434

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Fall 2016: Unsupervised and Reinforcement Learning in Neural Networks

[TOC]

Unsupervised learning

  • learning : act of acquiring new/ modifying(reinforcing existing knowledge/behaviors/skills
  • long term memory
    • explicit (declarative) : episodic (events), semantic (concepts)
    • implicit : procedural (skills), emotional conditioning, priming effect, conditioned reflex
  • types of learning
    • supervised : labeled data
    • unsupervised : no explicit teachings, concept formulation, detect intrinsic structure
    • reinforced : choice of actions, reward upon correct one
  • synaptic plasticity : change in connection strengh
  • neuron models
    • compartmental models
    • cable models
    • point models
    • Hodgkin-Huxley : biologically realistic, non-linear, computationally expensive, analyticially intractable
    • integrate and fire models
      • time constant : $\tau=RC$
    • Hebbian learning : cell $j$ repeatedly takes part in firing cell $i$, then $j$ to $i$ efficency is increased, local rule, correlations
  • synaptic dynamics
    • induction of changes
      • fast (if stimulated appropriately)
      • slow (homeostasis)
    • persistence of changes
      • long : long-term potentiation/depression
      • short : shot term plasticity
    • functionality : useful for
      • learning new behaviour
      • development (wiring receptive field)
      • activity control in network
      • coding
  • unsupervised vs reinforcement

Hebb's rule

  • Hebbian learning : rate-based theory
    • rate : $v_i^{post}=g(I_i)=g(\sum_j w_{ij}v_j^{pre})$
    • weight : $\Delta w_{ij}\propto F(pre,post)$
    • fixed rate vs jointly varying rate
  • Hebian rate-based model
    • BCM : lead to specialized neurons, developmental learning, postsynaptic term nonlinear (e.g. $\vartheta =f(\bar v_i^{post})$)
  • synaptic changes for development
    • initial random connections : unselective neurons
    • correlated input : output neurons
    • matrix : correlation $C_{kj}=\mean{x_kx_j}=\frac{1}{p}\sum_{\mu=1}^px_k^\mu x_j^\mu$ to covariance $C^0_{kj}=\mean{(x_k-\mean{x_k})(x_j-\mean{x_j})}$

Principal component analysis

  • PCA : orthogonal transformation to convert correlated variables into linearly uncorrelated variables
    • subtract mean : $\v x-\mean{\v x}$
    • calculate covariance matrix : $C_{ki}^0=\mean{(x_k-\mean{x_k})(x_j-\mean{x_j})}$
    • calculate eigenvectors : $C^0\cdot\v e_n=\lambda_n\cdot \v e_n$ with $\lambda_1\le\cdots\le\lambda_n$ (keep only first component)
    • neural implementation : need more than hebbian learning
      • encode patterns in firing rates
      • compute co-variance matrix $C$ of inputs
      • compute eigenvectors of $C$
    • all direction : equally likely
  • Oja rule

Independent component analysis

  • ICA : find natural axis
    • renormalize : mean zero, unit variance
    • whiten : PCA, divide each component by $\sqrt{\lambda^n}$
  • signal : mixed $x=As$ recover through $s\approx y=Wx$
  • non-gaussianity : maximise $J(y)=[E(F(y))-E_{Gauss}(F(v))]^2$
    • $J(w)=[\gamma(w)-a]^2$ : $\frac{\partial J}{\partial w}=0\iff\frac{\partial\gamma}{dw}=0$
    • $K(y)=E(y^4)-3E(y^2)^2$
    • gradient ascent : $\Delta w_n=\eta E[x_n g(\sum_k w_kx_k)]$ for $J(\v w)=E[F(\v w^\top\v x)]=E[F(y)]$
  • kurtosis : of $y=\v w\v x$ is at local max if project $\v w$ in direction of independent components
  • ICA gradient ascent
    • batch : $\v w^{new}=\v w^{old}+\eta E[\v g(\v w^\top\v x)]$
    • online : $\Delta\v w=\eta\v x g(\v w^\top\v x)$
    • online-components : $\Delta w_{ij}=\eta x_j g_i(\sum_k w_{ik}x_k)$
  • newton methods
  • fast ICA
    • initialize : $w$
    • newton step : $\v w^{new}=E[\v x g(\v w^\top\v x)-\v w g'(\v w^\top \v x)]$
    • renormalize : $\v w=\frac{\v w^{new}}{\abs{\v w^{new}}}$
    • loop if not converged
  • temporal ICA : find unmixing matrix $W$ st. output time-independent
    • $\mean{y_i(t)y_k(t-\tau)}=\delta_{ik}\lambda(\tau)$ for all delays
    • $\mean{y_i(t)y_k(t)}=\delta_{ik}$
  • application : localized edge detectors receptive fields development (gabor wavelets)
  • hebbian learning = unsupervised learning
    • in linear neurons : PCA
    • in nonlinear neurons : ICA

Clustering

  • network wiring is adaptive, not fixed
  • prototype : $w_k$
  • closest rule : $\abs{\v w_k-\v x}\le\abs{\v w_i-\v x};\forall i$
  • clustering : discretization, compression, quantization
  • reconstruction error : mean $E=\sum_k\sum_{\mu\in C_k}(\v w_k-\v x^\mu)^2$
  • k-means online :
    • determine winner $k$ : $\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$ (network dynamics)
    • update this prototype : $\Delta\v w_k=-\eta(\v w_k-\v x^\mu)$ (Ojas hebbian learning)
    • reduce : $\eta$ (e.g. $1/N_k$, developmental change)
  • k-means batch
    • for every datapoint find closest
    • update weights of all prototypes : $\Delta\v w_k=-\eta\sum_{\mu\in C_k}(\v w_k-\v x^\mu)$
  • equivalence : batch and online same for small learning rate
  • neuronal implementation
    • equivalence : assuming weights normalized
      • $\v w_k^\top\v x^\mu\ge\v w_i^\top\v x^\mu;\forall i$
      • $\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$
    • dead units : due to local minima, initialize weights to data samples and update winners with neighboring losers
    • winner take all : winner excitated itself, inhibit others, add noise
    • model

Kohonen networks

  • Kohonen's model : topographical feature map, spatially bounded excitation zone, input space mapped onto cortical space
    • model
    • 2D to 1D : travelling salesman problem
    • 2D to 2D
  • Kohonen algorithm
    • choose input : $\v x^\mu$
    • determine winner $k$ : $\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$
    • update weights of winner and neighbors : $\Delta \v w_j=\eta\Lambda(j,k)(\v x^\mu-\v w_j)$
    • decrease size of neighborhood function
    • iterate until stabilization
  • elastic net algorithm
    • choose input
    • calculate relative importance of $j$ : $\alpha(\abs{\v x^\mu -\v w_j})$
    • update everybody : $\Delta \v w_j=\eta\alpha(\abs{\v x^\mu-\v w_j})(\v x^\mu-\v w_j)+\sum_{k\in T_j}(\v w_k-\v w_j)$ with $T_j$ neighbors in input space
    • iterative until stabilization
    • importance factor : $\alpha(\abs{\v x^\mu -\v w_j})=\frac{\exp(-\gamma\abs{\v x^\mu-\v w_k}^2)}{\sum_k\exp(-\gamma\abs{\v x^\mu-\v w_k}^2)}$

Receptive field development

  • self-organization of cortex
    • neurons have localized receptive fields (RF)
    • neighboring neurons have similair RF
    • RC and neighbors change with experience
  • simple cells and RF as Gabor filters
    • sensitivity : edge detection
  • complex cells and energy model
    • sensitivity : oriented bars, rough position, invariant
    • classification
    • stimulation : light spot or gabor-like wavelet
  • position invariant cells in inferotemporal cortex (IT) : side (bottom) part of ventral way
  • hebbian learning and RF development
  • localized receptive fields with orientation tuning : hebbian learning + linear neurons + arborzation function

Long-term potentation

  • spike-based hebbian learning
  • spike-time dependent plasticity by local variable
  • induction rule : abs
    • depression
    • potentiation
  • STDP model yields BCM

Reinforcement learning

  • conditioning : Pawlow's dog
  • RL theory
  • exploration exploitation dilemma
    • explore : estimate reward probabilities
    • exploitation : take action which looks optimal to maximize reward
  • strategy/policy
    • greedy : best action $Q(s,a^*)>Q(s,a_j)$
    • $\epsilon$-greedy : best action with probability $P=1-\epsilon$
    • softmax : action with probability $P(a')=\frac{\exp(\beta Q(a'))}{\sum_a\exp(\beta Q(a))}$
      • temperature : $T=1/\beta$
    • optimistic : start with q-values that are too big
  • multi-step horizon
  • Bellman equation : $Q(s,a)=\sum_{s'}P_{s\to s'}^a(R_{s\to s'}^a+\gamma\sum_{a'}\pi(s',a')Q(s',a'))$
  • sarsa
    • initialize Q-values, start from initial state
    • in state $s$, choose action $a$ according to policy
    • observe reward $r$ and next state $s'$
    • choose action $a'$ in state $s'$ according to policy
    • update Q-values
    • iterate
  • sarsa convergence in expectation : when $\mean{\Delta Q(s,a)}=0$
    • to $Q(s,a)=\sum_{s'}P_{s\to s'}^aR_{s\to s'}^a$ if $\Delta Q(s,a)=\eta[r-Q(s,a)]$
    • to Bellman eqs if $\Delta Q(s,a)=\eta[r-(Q(s,a)-\gamma Q(s',a'))]$
  • v-values
  • Q-learning : encode piece of trajectory, no eligibility trace needed
  • off-policy : another policy for update (e.g. greedy)
  • eligibility traces : at the moment of reward, update also previous action values along trajectories
  • continuous state spaces : $Q(s,a)=\sum_j w_{aj}\Phi(s-s_j)$
  • temporal difference (TD) sarsa algo
  • RL with linear function approximation
    • when state-spaces large or continuous
    • generalize between similar states
    • restricted to linearly separable state space dimensions
    • often fails (local minima, diverges) when complex or unknown structure of state space
    • value function as neural nets : trained with SGD (supervised)
  • experience replay : learning works best on uncorrelated data so store all state, action, reward transitions and train by randomly sampling from list
  • network multiplexing : solve moving target problem by keeping the target constant for some time
  • action values Q vs state values V : risk-seeking/aversive
  • when to avoid TD methods (Q-learning, sarsa) : hard-to-tune, diverge, good for fully observable markov processes

Policy gradient methods

  • policy gradient : forgot Q-values, optimize directly reward for an action
    • associate actions with stimuli in a stochastic fashion
    • change policy parametrically using gradient
    • optimize total expected reward : $J=\mean{\mean{R}_a}_s=\sum_sP(s)\sum_a\pi(a\mid s)R(a,s)$ with stimuli $s$ (states implicitly)
  • log-likelihood trick
  • policy gradient estimation : sample average as Monte Carlo approximation $\Delta w_i=\eta\frac{\partial}{\partial w_i}\mean{\mean{R}_a}s\approx\eta\frac{1}{N}\sum{n=1}^N\frac{\partial}{\partial w_i}\pi(s_n)R(a_n\mid s_n)$
  • biology gradient
  • matching law : average reward is same on multiple source