GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Fall 2016: Unsupervised and Reinforcement Learning in Neural Networks
[TOC]
- learning : act of acquiring new/ modifying(reinforcing existing knowledge/behaviors/skills
- long term memory
- explicit (declarative) : episodic (events), semantic (concepts)
- implicit : procedural (skills), emotional conditioning, priming effect, conditioned reflex
- types of learning
- supervised : labeled data
- unsupervised : no explicit teachings, concept formulation, detect intrinsic structure
- reinforced : choice of actions, reward upon correct one
- synaptic plasticity : change in connection strengh
- neuron models
- compartmental models
- cable models
- point models
- Hodgkin-Huxley : biologically realistic, non-linear, computationally expensive, analyticially intractable
- integrate and fire models
- Hebbian learning : cell
$j$ repeatedly takes part in firing cell$i$ , then$j$ to$i$ efficency is increased, local rule, correlations
- synaptic dynamics
- unsupervised vs reinforcement
- PCA : orthogonal transformation to convert correlated variables into linearly uncorrelated variables
- subtract mean :
$\v x-\mean{\v x}$ - calculate covariance matrix :
$C_{ki}^0=\mean{(x_k-\mean{x_k})(x_j-\mean{x_j})}$ - calculate eigenvectors :
$C^0\cdot\v e_n=\lambda_n\cdot \v e_n$ with$\lambda_1\le\cdots\le\lambda_n$ (keep only first component) - neural implementation : need more than hebbian learning
- all direction : equally likely
- subtract mean :
- Oja rule
- ICA : find natural axis
- renormalize : mean zero, unit variance
- whiten : PCA, divide each component by
$\sqrt{\lambda^n}$
- signal : mixed
$x=As$ recover through$s\approx y=Wx$ - non-gaussianity : maximise
$J(y)=[E(F(y))-E_{Gauss}(F(v))]^2$ -
$J(w)=[\gamma(w)-a]^2$ :$\frac{\partial J}{\partial w}=0\iff\frac{\partial\gamma}{dw}=0$ $K(y)=E(y^4)-3E(y^2)^2$ - gradient ascent :
$\Delta w_n=\eta E[x_n g(\sum_k w_kx_k)]$ for$J(\v w)=E[F(\v w^\top\v x)]=E[F(y)]$
-
- kurtosis : of
$y=\v w\v x$ is at local max if project$\v w$ in direction of independent components - ICA gradient ascent
- batch :
$\v w^{new}=\v w^{old}+\eta E[\v g(\v w^\top\v x)]$ - online :
$\Delta\v w=\eta\v x g(\v w^\top\v x)$ - online-components :
$\Delta w_{ij}=\eta x_j g_i(\sum_k w_{ik}x_k)$
- batch :
- newton methods
- fast ICA
- initialize :
$w$ - newton step :
$\v w^{new}=E[\v x g(\v w^\top\v x)-\v w g'(\v w^\top \v x)]$ - renormalize :
$\v w=\frac{\v w^{new}}{\abs{\v w^{new}}}$ - loop if not converged
- initialize :
- temporal ICA : find unmixing matrix
$W$ st. output time-independent-
$\mean{y_i(t)y_k(t-\tau)}=\delta_{ik}\lambda(\tau)$ for all delays $\mean{y_i(t)y_k(t)}=\delta_{ik}$
-
- application : localized edge detectors receptive fields development (gabor wavelets)
- hebbian learning = unsupervised learning
- in linear neurons : PCA
- in nonlinear neurons : ICA
- network wiring is adaptive, not fixed
- prototype :
$w_k$ - closest rule :
$\abs{\v w_k-\v x}\le\abs{\v w_i-\v x};\forall i$ - clustering : discretization, compression, quantization
- reconstruction error : mean
$E=\sum_k\sum_{\mu\in C_k}(\v w_k-\v x^\mu)^2$ - k-means online :
- determine winner
$k$ :$\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$ (network dynamics) - update this prototype :
$\Delta\v w_k=-\eta(\v w_k-\v x^\mu)$ (Ojas hebbian learning) - reduce :
$\eta$ (e.g.$1/N_k$ , developmental change)
- determine winner
- k-means batch
- for every datapoint find closest
- update weights of all prototypes :
$\Delta\v w_k=-\eta\sum_{\mu\in C_k}(\v w_k-\v x^\mu)$
- equivalence : batch and online same for small learning rate
- neuronal implementation
- equivalence : assuming weights normalized
$\v w_k^\top\v x^\mu\ge\v w_i^\top\v x^\mu;\forall i$ $\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$
- dead units : due to local minima, initialize weights to data samples and update winners with neighboring losers
- winner take all : winner excitated itself, inhibit others, add noise
- model
- equivalence : assuming weights normalized
- Kohonen's model : topographical feature map, spatially bounded excitation zone, input space mapped onto cortical space
- Kohonen algorithm
- choose input :
$\v x^\mu$ - determine winner
$k$ :$\abs{\v w_k-\v x^\mu}\le\abs{\v w_i-\v x^\mu};\forall i$ - update weights of winner and neighbors :
$\Delta \v w_j=\eta\Lambda(j,k)(\v x^\mu-\v w_j)$ - decrease size of neighborhood function
- iterate until stabilization
- choose input :
- elastic net algorithm
- choose input
- calculate relative importance of
$j$ :$\alpha(\abs{\v x^\mu -\v w_j})$ - update everybody :
$\Delta \v w_j=\eta\alpha(\abs{\v x^\mu-\v w_j})(\v x^\mu-\v w_j)+\sum_{k\in T_j}(\v w_k-\v w_j)$ with$T_j$ neighbors in input space - iterative until stabilization
- importance factor :
$\alpha(\abs{\v x^\mu -\v w_j})=\frac{\exp(-\gamma\abs{\v x^\mu-\v w_k}^2)}{\sum_k\exp(-\gamma\abs{\v x^\mu-\v w_k}^2)}$
- self-organization of cortex
- neurons have localized receptive fields (RF)
- neighboring neurons have similair RF
- RC and neighbors change with experience
- simple cells and RF as Gabor filters
- complex cells and energy model
- position invariant cells in inferotemporal cortex (IT) : side (bottom) part of ventral way
- hebbian learning and RF development
- localized receptive fields with orientation tuning : hebbian learning + linear neurons + arborzation function
- spike-based hebbian learning
- spike-time dependent plasticity by local variable
- induction rule : abs
- STDP model yields BCM
- conditioning : Pawlow's dog
- RL theory
- exploration exploitation dilemma
- explore : estimate reward probabilities
- exploitation : take action which looks optimal to maximize reward
- strategy/policy
- greedy : best action
$Q(s,a^*)>Q(s,a_j)$ -
$\epsilon$ -greedy : best action with probability$P=1-\epsilon$ - softmax : action with probability
$P(a')=\frac{\exp(\beta Q(a'))}{\sum_a\exp(\beta Q(a))}$ - temperature :
$T=1/\beta$
- temperature :
- optimistic : start with q-values that are too big
- greedy : best action
- multi-step horizon
- Bellman equation :
$Q(s,a)=\sum_{s'}P_{s\to s'}^a(R_{s\to s'}^a+\gamma\sum_{a'}\pi(s',a')Q(s',a'))$ - sarsa
- initialize Q-values, start from initial state
- in state
$s$ , choose action$a$ according to policy - observe reward
$r$ and next state$s'$ - choose action
$a'$ in state$s'$ according to policy - update Q-values
- iterate
- sarsa convergence in expectation : when
$\mean{\Delta Q(s,a)}=0$ - to
$Q(s,a)=\sum_{s'}P_{s\to s'}^aR_{s\to s'}^a$ if$\Delta Q(s,a)=\eta[r-Q(s,a)]$ - to Bellman eqs if
$\Delta Q(s,a)=\eta[r-(Q(s,a)-\gamma Q(s',a'))]$
- to
- v-values
- Q-learning : encode piece of trajectory, no eligibility trace needed
- off-policy : another policy for update (e.g. greedy)
- eligibility traces : at the moment of reward, update also previous action values along trajectories
- continuous state spaces :
$Q(s,a)=\sum_j w_{aj}\Phi(s-s_j)$ - temporal difference (TD) sarsa algo
- RL with linear function approximation
- experience replay : learning works best on uncorrelated data so store all state, action, reward transitions and train by randomly sampling from list
- network multiplexing : solve moving target problem by keeping the target constant for some time
- action values Q vs state values V : risk-seeking/aversive
- when to avoid TD methods (Q-learning, sarsa) : hard-to-tune, diverge, good for fully observable markov processes
- policy gradient : forgot Q-values, optimize directly reward for an action
- associate actions with stimuli in a stochastic fashion
- change policy parametrically using gradient
- optimize total expected reward :
$J=\mean{\mean{R}_a}_s=\sum_sP(s)\sum_a\pi(a\mid s)R(a,s)$ with stimuli$s$ (states implicitly)
- log-likelihood trick
- policy gradient estimation : sample average as Monte Carlo approximation $\Delta w_i=\eta\frac{\partial}{\partial w_i}\mean{\mean{R}_a}s\approx\eta\frac{1}{N}\sum{n=1}^N\frac{\partial}{\partial w_i}\pi(s_n)R(a_n\mid s_n)$
- biology gradient
- matching law : average reward is same on multiple source