You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
R: develop linear regression / OLS further ---> up to the analytical solution
R: NN Tesla HydraNet
R: prepare one code example
R: prepare handwritten developments
class: middle, center
.width-50[]
Today
.center.width-50[]
Make our agents capable of self-improvement through a learning mechanism.
Search algorithms, using a state space specified by domain knowledge.
(Constraint satisfaction problems, by exploiting a known structure of the states.)
(Logical inference, using well-specified facts and inference rules.)
Adversarial search, for known and fully observable games.
Reasoning about uncertain knowledge, as represented using domain-motivated probabilistic models.
Taking optimal decisions, under uncertainty and possibly under partial observation.
Sufficient to implement complex and rational behaviors, in some situations.
.alert[Aren't we missing something?]
???
Is that intelligence? Aren't we missing a critical component?
=> Learning component
The auestion is then to determine what should be pre-defined from what should be learned.
class: middle
Learning agents
What if the environment is unknown?
Learning can be used as a system construction method.
Expose the agent to reality rather than trying to hardcode reality into the agent's program.
Learning provides an automated way to modify the agent's internal decision mechanisms to improve its own performance.
class: middle
.center.width-80[]
???
Performance element:
The current system for selecting actions.
The critic observes the world and passes information to the learning element.
The learning element tries to modifies the performance element to avoid reproducing this situation in the future.
The problem generator identifies certain areas of behavior in need of improvement and suggest experiments.
class: middle
The design of the learning element is dictated by:
What type of performance element is used.
Which functional component is to be learned.
How that functional component is represented.
What kind of feedback is available.
.center.width-80[]
class: middle
Bayesian learning
Bayesian learning
Frame learning as a Bayesian update of a probability distribution ${\bf P}(H)$ over a hypothesis space, where
$H$ is the hypothesis variable
values are $h_1$, $h_2$, ...
the prior is ${\bf P}(H)$,
$\mathbf{d}$ is the observed data.
class: middle
Given data, each hypothesis has a posterior probability
$$P(h_i|\mathbf{d}) = \frac{P(\mathbf{d}|h_i) P(h_i)}{P(\mathbf{d})},$$ where $P(\mathbf{d}|h_i)$ is the likelihood of the hypothesis.
class: middle
Predictions use a likelihood-weighted average over the hypotheses:
$$P(X|\mathbf{d}) = \sum_i P(X|\mathbf{d}, h_i) P(h_i | \mathbf{d}) = \sum_i P(X|h_i) P(h_i | \mathbf{d})$$
No need to pick one best-guess hypothesis!
class: middle
Example
Suppose there are five kinds of bags of candies. Assume a prior ${\bf P}(H)$:
$P(h_1)=0.1$, with $h_1$: 100% cherry candies
$P(h_2)=0.2$, with $h_2$: 75% cherry candies + 25% lime candies
$P(h_3)=0.4$, with $h_3$: 50% cherry candies + 50% lime candies
$P(h_4)=0.2$, with $h_4$: 25% cherry candies + 75% lime candies
$P(h_5)=0.1$, with $h_5$: 100% lime candies
.center.width-70[![](figures/lec7/candies.png)]
class: middle
Then we observe candies drawn from some bag:
.center.width-40[]
What kind of bag is it?
What flavor will the next candy be?
class: middle
Posterior probability of hypotheses
.center.width-60[]
class: middle
Prediction probability
.center.width-60[]
This example illustrates the fact that the Bayesian prediction eventually agrees with the true hypothesis.
The posterior probability of any false hypothesis eventually vanishes (under weak assumptions).
Maximum a posteriori
Summing over the hypothesis space is often intractable.
Instead,
maximum a posteriori (MAP) estimation consists in using the hypothesis
$$
\begin{aligned}
h_\text{MAP} &= \arg \max_{h_i} P(h_i | \mathbf{d}) \\
&= \arg \max_{h_i} P(\mathbf{d}|h_i) P(h_i) \\
&= \arg \max_{h_i} \log P(\mathbf{d}|h_i) + \log P(h_i)
\end{aligned}$$
Log terms can be be viewed as (the negative number of) bits to encode data given hypothesis + bits to encode hypothesis.
This is the basic idea of minimum description length learning, i.e., Occam's razor.
Finding the MAP hypothesis is often much easier than Bayesian learning.
It requires solving an optimization problem instead of a large summation problem.
Maximum likelihood
For large data sets, the prior ${\bf P(}H)$ becomes irrelevant.
In this case, maximum likelihood estimation (MLE) consists in using the hypothesis
$$h_\text{MLE} = \arg \max_{h_i} P(\mathbf{d} | h_i).$$
Identical to MAP for uniform prior.
Maximum likelihood estimation is the standard (non-Bayesian) statistical learning method.
class: middle
Recipe
Choose a parameterized family of models to describe the data (e.g., a Bayesian network).
Write down the log-likelihood $L$ of the parameters $\theta$.
Write down the derivative of the log likelihood of the parameters $\theta$.
Find the parameter values $\theta^*$ such that the derivatives are zero and check whether the Hessian is negative definite.
???
Note that:
evaluating the likelihood may require summing over hidden variables, i.e., inference.
finding $\theta^*$ may be hard; modern optimization techniques help.
Parameter estimation in Bayesian networks
.center.width-100[]
class: middle
MLE, case (a)
What is the fraction $\theta$ of cherry candies?
Any $\theta \in [0,1]$ is possible: continuum of hypotheses $h_\theta$.
$\theta$ is a parameter for this binomial family of models.
Suppose we unwrap $N$ candies, and get $c$ cherries and $l=N-c$ limes.
These are i.i.d. observations, therefore
$$P(\mathbf{d}|h_\theta) = \prod_{j=1}^N P(d_j | h_\theta) = \theta^c (1-\theta)^l.$$
Maximize this w.r.t. $\theta$, which is easier for the log-likelihood:
$$\begin{aligned}
L(\mathbf{d}|h_\theta) &= \log P(\mathbf{d}|h_\theta) = c \log \theta + l \log(1-\theta) \\
\frac{d L(\mathbf{d}|h_\theta)}{d \theta} &= \frac{c}{\theta} - \frac{l}{1-\theta}=0.
\end{aligned}$$
Hence $\theta=\frac{c}{N}$.
???
Highlight that using the empirical estimate as an estimator of the mean can be viewed as consequence of
deciding on a probabilistic model
maximum likelihood estimation under this model
Seems sensible, but causes problems with $0$ counts!
class: middle
MLE, case (b)
Red and green wrappers depend probabilistically on flavor.
E.g., the likelihood for a cherry candy in green wrapper:
$$\begin{aligned}
&P(\text{cherry}, \text{green}|h_{\theta,\theta_1, \theta_2}) \\
&= P(\text{cherry}|h_{\theta,\theta_1, \theta_2}) P(\text{green}|\text{cherry}, h_{\theta,\theta_1, \theta_2}) \\
&= \theta (1-\theta_1).
\end{aligned}$$
The likelihood for the data, given $N$ candies, $r_c$ red-wrapped cherries, $g_c$ green-wrapped cherries, etc., is:
$$\begin{aligned}
P(\mathbf{d}|h_{\theta,\theta_1, \theta_2}) =&,, \theta^c (1-\theta)^l \theta_1^{r_c}(1-\theta_1)^{g_c} \theta_2^{r_l} (1-\theta_2)^{g_l} \\
L =&,, c \log \theta + l \log(1-\theta) + \\
&,, r_c \log \theta_1 + g_c \log(1-\theta_1) + \\
&,, r_l \log \theta_2 + g_l \log(1-\theta_2)
\end{aligned}$$
.exercise[How would you write a computer program that recognizes cats from dogs?]
class: middle
.center.width-60[]
count: false
class: black-slide, middle
.center.width-50[]
.center[The good old-fashioned approach.]
count: false
class: black-slide, middle
.center.width-80[]
count: false
class: black-slide, middle
.center.width-80[]
class: middle
.center.width-100[]
.center[The deep learning approach.]
Problem statement
Let us assume data $\mathbf{d} \sim p(\mathbf{x}, y)$ of $N$ example input-output pairs
$$\mathbf{d} = \{ (\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), ..., (\mathbf{x}_N, y_N) \},$$
where
$\mathbf{x}_i$ are the input data and
$y_i$ was generated by an unknown function $y_i=f(\mathbf{x}_i)$.
From this data, we want to find a function $h \in \mathcal{H}$ that approximates the true function $f$.
???
$\mathcal{H}$ is huge! How do we find a good hypothesis?
class: middle
.center.width-10[]
In general, $f$ will be stochastic. In this case, $y$ is not strictly a function $x$, and we wish to learn the conditional $p(y|\mathbf{x})$.
Most of supervised learning is actually (approximate) maximum likelihood estimation on (huge) parametric models.
class: middle
Feature vectors
Input samples $\mathbf{x} \in \mathbb{R}^d$ are described as real-valued vectors of $d$ attributes or features values.
If the data is not originally expressed as real-valued vectors, then it needs to be prepared and transformed to this format.
.center.width-90[]
Linear regression considers a parameterized linear Gaussian model for its parametric model of $p(y|\mathbf{x})$, that is
$$p(y|\mathbf{x}) = \mathcal{N}(y | \mathbf{w}^T \mathbf{x} + b, \sigma^2),$$
where $\mathbf{w}$ and $b$ are parameters to determine.
To learn the conditional distribution $p(y|\mathbf{x})$, we maximize
$$p(y|\mathbf{x}) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y-(\mathbf{w}^T \mathbf{x} + b))^2}{2\sigma^2}\right)$$
w.r.t. $\mathbf{w}$ and $b$ over the data $\mathbf{d} = \{ (\mathbf{x}_j, y_j) \}$.
--
count: false
By constraining the derivatives of the log-likelihood to $0$, we arrive to the problem of minimizing
$$\sum_{j=1}^N (y_j - (\mathbf{w}^T \mathbf{x}_j + b))^2.$$
Therefore, minimizing the sum of squared errors corresponds to the MLE solution for a linear fit, assuming Gaussian noise of fixed variance.
The linear classifier model is a squashed linear function of its inputs:
$$h(\mathbf{x}; \mathbf{w}, b) = \text{sign}(\mathbf{w}^T \mathbf{x} + b)$$
.center.width-60[]
class: middle
.center.width-30[]
Without loss of generality, the model can be rewritten without $b$ as $h(\mathbf{x}; \mathbf{w}) = \text{sign}(\mathbf{w}^T \mathbf{x})$, where $\mathbf{w} \in \mathbb{R}^{d+1}$ and $\mathbf{x}$ is extended with a dummy element $x_0 = 1$.
Predictions are computed by comparing the feature vector $\mathbf{x}$ to the weight vector $\mathbf{w}$. Geometrically, $\mathbf{w}^T \mathbf{x}$ corresponds to $||\mathbf{w}|| ||\mathbf{x}|| \cos(\theta)$.
???
The family $\mathcal{H}$ of hypothesis is induced from the set $\mathbb{R}^{d+1}$ of possible parameters values $\mathbf{w}$ . Learning consists in finding a good vector $\mathbf{w}$ in this space.
Perceptron
.grid[
.kol-1-2[
Start with $\mathbf{w}=0$.
For each training example $(\mathbf{x},y)$:
Classify with current weights: $\hat{y} = \text{sign}(\mathbf{w}^T \mathbf{x})$