class: middle, center, title-slide
Lecture 4: Representing uncertain knowledge
Prof. Gilles Louppe
[email protected]
class: middle, center
???
Motivate why this is important in AI (and this is not just one more probability theory class).
.grid[ .kol-1-2[
- Probability:
- Random variables
- Joint and marginal distributions
- Conditional distributions
- Product rule, Chain rule, Bayes' rule
- Inference
- Bayesian networks:
.alert[Do not overlook this lecture!]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
class: middle
.grid[ .kol-1-2[ A ghost is hidden in the grid somewhere.
Sensor readings tell how close a square is to the ghost:
- On the ghost: red
- 1 or 2 away: orange
- 3 away: yellow
- 4+ away green
]
.kol-1-2[.width-100[
]] ] Sensors are noisy, but we know the probability values
$P(\text{color}|\text{distance})$ , for all colors and all distances.
.footnote[Image credits: CS188, UC Berkeley.]
class: middle, black-slide
.center[
.footnote[Image credits: CS188, UC Berkeley.]
???
Could we use a logical agent for this game?
General setup:
- Observed variables or evidence: agent knows certain things about the state of the world (e.g., sensor readings).
- Unobserved variables: agent needs to reason about other aspects that are uncertain (e.g., where the ghost is).
- (Probabilistic) model: agent knows or believes something about how the known variables relate to the unknown variables.
class: middle
How to handle uncertainty?
- A purely logical approach either risks falsehood, or leads to conclusions that are too weak for decision making.
- Probabilistic reasoning provides a framework for managing our knowledge and beliefs.
???
Falsehood: because of ignorance about the world or laziness in the model.
Weak conclusions: remember the Wumpus example.
Probabilistic assertions express the agent's inability to reach a definite decision regarding the truth of a proposition.
- Probability values summarize effects of
- laziness (failure to enumerate all world states)
- ignorance (lack of relevant facts, initial conditions, correct model, etc).
- Probabilities relate propositions to one's own state of knowledge (or lack thereof).
- e.g.,
$P(\text{ghost in cell } [3,2]) = 0.02$
- e.g.,
class: middle
What do probability values represent?
- The objectivist frequentist view is that probabilities are real aspects of the universe.
- i.e., propensities of objects to behave in certain ways.
- e.g., the fact that a fair coin comes up heads with probability
$0.5$ is a propensity of the coin itself.
- The subjectivist Bayesian view is that probabilities are a way of characterizing an agent's beliefs or uncertainty.
- i.e., probabilities do not have external physical significance.
- This is the interpretation of probabilities that we will use!
Begin with a set
A probability space is a sample space equipped with a probability function, i.e. an assignment
- 1st axiom:
$P(\omega) \in \mathbb{R}$ ,$0 \leq P(\omega)$ for all$\omega \in \Omega$ - 2nd axiom:
$P(\Omega) = 1$ - 3rd axiom:
$P(\{ \omega_1, ..., \omega_n \}) = \sum_{i=1}^n P(\omega_i)$ for any set of samples
where
class: middle
-
$\Omega$ = the 6 possible rolls of a die. -
$\omega_i$ (for$i=1, ..., 6$ ) are the sample points, each corresponding to an outcome of the die. - Assignment
$P$ for a fair die:$$P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac{1}{6}$$
- A random variable is a function
$X: \Omega \to D_X$ from the sample space to some domain defining its outcomes.- e.g.,
$\text{Odd}: \Omega \to \{ \text{true}, \text{false} \}$ such that$\text{Odd}(\omega) = (\omega,\text{mod},2 = 1)$ .
- e.g.,
-
$P$ induces a probability distribution for any random variable$X$ .$P(X=x_i) = \sum_{\{\omega: X(\omega)=x_i\}} P(\omega)$ - e.g.,
$P(\text{Odd}=\text{true}) = P(1)+P(3)+P(5) = \frac{1}{2}$ .
- An event
$E$ is a set of outcomes$\{(x_1, ..., x_n)_i\}$ of the variables$X_1, ..., X_n$ , such that$$P(E) = \sum_{(x_1, ..., x_n) \in E} P(X_1=x_1, ..., X_n=x_n).$$
???
In practice, we will use random variables to represent aspects of the world about which we (may) have uncertainty.
-
$R$ : Is it raining? -
$T$ : Is it hot or cold? -
$L$ : Where is the ghost?
class: middle
- Random variables are written in upper roman letters:
$X$ ,$Y$ , etc. - Realizations of a random variable are written in corresponding lower case letters.
E.g.,
$x_1$ ,$x_2$ , ...,$x_n$ could be of outcomes of the random variable$X$ . - The probability value of the realization
$x$ is written as$P(X=x)$ . - When clear from context, this will be abbreviated as
$P(x)$ . - The probability distribution of the (discrete) random variable
$X$ is denoted as${\bf{P}}(X)$ . This corresponds e.g. to a vector of numbers, one for each of the probability values$P(X=x_i)$ (and not to a single scalar value!).
For discrete variables, the probability distribution can be encoded by a discrete list of the probabilities of the outcomes, known as the probability mass function.
One can think of the probability distribution as a table that associates a probability value to each outcome of the variable.
.grid[
.center.kol-1-2[
.footnote[Image credits: CS188, UC Berkeley.]
???
- This table can be infinite!
- By construction, probability values are normalized (i.e., sum to
$1$ ).
class: middle
A joint probability distribution over a set of random variables
.center[${\bf P}(T,W)$]
class: middle
From a joint distribution, the probability of any event can be calculated.
- Probability that it is hot and sunny?
- Probability that it is hot?
- Probability that it is hot or sunny?
Interesting events often correspond to partial assignments, e.g.
class: middle
The marginal distribution of a subset of a collection of random variables is the joint probability distribution of the variables contained in the subset.
.center.grid[
.kol-1-3[
] | ||
.kol-1-3[ | ||
Intuitively, marginal distributions are sub-tables which eliminate variables.
???
To what events are marginal probabilities associated?
class: middle
The conditional probability of a realization
Indeed, observing
class: middle
Conditional distributions are probability distributions over some variables, given fixed values for others.
.center.grid[
.kol-1-3[
] | ||
.kol-1-3[ | ||
${\bf P}(W | T=\text{hot})$ |
] | |
.kol-1-3[ | |
${\bf P}(W | T=\text{cold})$ |
] | |
] |
class: middle
.center.grid[
.kol-1-3[
] | ||
.kol-1-3[ | ||
Select the joint probabilities matching the evidence
]
.kol-1-3[
Normalize the selection (make it sum to
] ]
Probabilistic inference is the problem of computing a desired probability from other known probabilities (e.g., conditional from joint).
- We generally compute conditional probabilities.
- e.g.,
$P(\text{on time} | \text{no reported accidents}) = 0.9$ - These represent the agent's beliefs given the evidence.
- e.g.,
- Probabilities change with new evidence:
- e.g.,
$P(\text{on time} | \text{no reported accidents}, \text{5AM}) = 0.95$ - e.g.,
$P(\text{on time} | \text{no reported accidents}, \text{rain}) = 0.8$ - e.g.,
$P(\text{ghost in } [3,2] | \text{red in } [3,2]) = 0.99$ - Observing new evidence causes beliefs to be updated.
- e.g.,
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
-
Evidence variables:
$E_1, ..., E_k = e_1, ..., e_k$ -
Query variables:
$Q$ -
Hidden variables:
$H_1, ..., H_r$ -
$(Q \cup E_1, ..., E_k \cup H_1, ..., H_r)$ = all variables$X_1, ..., X_n$
Inference is the problem of computing
Start from the joint distribution
- Select the entries consistent with the evidence
$E_1, ..., E_k = e_1, ..., e_k$ . - Marginalize out the hidden variables to obtain the joint of the query and the evidence variables:
$${\bf P}(Q,e_1,...,e_k) = \sum_{h_1, ..., h_r} {\bf P}(Q, h_1, ..., h_r, e_1, ..., e_k).$$ - Normalize:
$$\begin{aligned} Z &= \sum_q P(q,e_1,...,e_k) \\\\ {\bf P}(Q|e_1, ..., e_k) &= \frac{1}{Z} {\bf P}(Q,e_1,...,e_k) \end{aligned}$$
class: middle
.grid[ .kol-1-2[
-
${\bf P}(W)$ ? -
${\bf P}(W|\text{winter})$ ? -
${\bf P}(W|\text{winter},\text{hot})$ ?
] .center.kol-1-2[
] ]
class: middle
- Inference by enumeration can be used to answer probabilistic queries for discrete variables (i.e., with a finite number of values).
- However, enumeration does not scale!
- Assume a domain described by
$n$ variables taking at most$d$ values. - Space complexity:
$O(d^n)$ - Time complexity:
$O(d^n)$
- Assume a domain described by
.exercise[Can we reduce the size of the representation of the joint distribution?]
.center.grid[
.kol-1-3[
] | |
.kol-1-3[ | |
${\bf P}(D | W)$ |
]
.kol-1-3[
? | ||
? | ||
? | ||
? |
] ]
More generally, any joint distribution can always be written as an incremental product of conditional distributions:
-
$P(a|b) = P(a)$ , or -
$P(b|a) = P(b)$ , or $P(a,b) = P(a)P(b)$
Independence is denoted as
???
... from the third expression, one can already notice that assuming independences leads to a factorization in which the factors are smaller.
class: middle
.center[
.width-40[]
.width-45[
]
]
The original 32-entry table reduces to one 8-entry and one 4-entry table (assuming 4 values for
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
For
$2^n \to n$
-
$P(a|b,c) = P(a|c)$ , or -
$P(b|a,c) = P(b|c)$ , or $P(a,b|c) = P(a|c)P(b|c)$
Conditional independence is denoted as
class: middle
- Using the chain rule, the join distribution can be factored as a product of conditional distributions.
- Each conditional distribution may potentially be simplified by conditional independence.
- Conditional independence assertions allow probabilistic models to scale up.
class: middle
Assume three random variables
In this case, the representation of the joint distribution reduces to
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
More generally, from the product rule, we have
Assuming pairwise conditional independence between the effects given the cause, it comes:
This probabilistic model is called a naive Bayes model.
- The complexity of this model is
$O(n)$ instead of$O(2^n)$ without the conditional independence assumptions. - Naive Bayes can work surprisingly well in practice, even when the assumptions are wrong.
???
This is an important model you should know about!
class: middle, center, red-slide count: false
Study the next slide. .bold[Twice].
.grid[ .kol-2-3[
The product rule defines two ways to factor the joint distribution of two random variables.
]
]
]
-
$P(a)$ is the prior belief on$a$ . -
$P(b)$ is the probability of the evidence$b$ . -
$P(a|b)$ is the posterior belief on$a$ , given the evidence$b$ . -
$P(b|a)$ is the conditional probability of$b$ given$a$ . Depending on the context, this term is called the likelihood.
???
Do it on the blackboard.
class: middle, center
.grid[ .kol-1-2[
]
.kol-1-2[.center.width-80[]]
]
The Bayes' rule is the foundation of many AI systems.
???
Bayes rule/inference: emphasize that is like Sherlock Holmes:
- Start from a set of possibilities (prior)
- Discard/weigh down those not compatible with the observations/evidence.
The Bayes' rule gives us a way to operationalize the update of our beliefs.
class: middle
-
$P(\text{effect}|\text{cause})$ quantifies the relationship in the causal direction. -
$P(\text{cause}|\text{effect})$ describes the diagnostic direction.
Let
???
... or
class: middle
- Let us assume a random variable
$G$ for the ghost location and a set of random variables$R_{i,j}$ for the individual readings. - We start with a uniform prior distribution
${\bf P}(G)$ over ghost locations. - We assume a sensor reading model
${\bf P}(R_{i,j}|G)$ .- That is, we know what the sensors do.
-
$R_{i,j}$ = reading color measured at$[i,j]$ - e.g.,
$P(R_{1,1}=\text{yellow}|G=[1,1])=0.1$
- e.g.,
- Two readings are conditionally independent, given the ghost position.
???
This is a Naive Bayes model!
class: middle
- We can calculate the posterior distribution
${\bf P}(G|R_{i,j})$ using Bayes' rule:$${\bf P}(G|R_{i,j}) = \frac{ {\bf P}(R_{i,j}|G){\bf P}(G)}{ {\bf P}(R_{i,j})}.$$ - For the next reading
$R_{i',j'}$ , this posterior distribution becomes the prior distribution over ghost locations, which we update similarly.
class: middle, black-slide
.center[
.footnote[Image credits: CS188, UC Berkeley.]
???
What if we had chosen a different prior?
class: middle
class: middle
.center.width-90[]
.grid[
.kol-1-5.center[
SM with parameters
.width-100[]]
.kol-2-5.center[
Simulated observables
.width-80[]]
.kol-2-5.center[
Real observations
class: middle
Given some observation
class: middle
class: middle
- The joint probability distribution can answer any question about the domain.
- However, its representation can become intractably large as the number of variable grows.
- Independence and conditional independence reduce the number of probabilities that need to be specified in order to define the full joint distribution.
- These relationships can be represented explicitly in the form of a Bayesian network.
A Bayesian network is a directed acyclic graph (DAG) in which:
- Each node corresponds to a random variable.
- Can be observed or unobserved.
- Can be discrete or continuous.
- Each edge indicates dependency relationships.
- If there is an arrow from node
$X$ to node$Y$ ,$X$ is said to be a parent of$Y$ .
- If there is an arrow from node
- Each node
$X_i$ is annotated with a conditional probability distribution${\bf P}(X_i | \text{parents}(X_i))$ that quantifies the effect of the parents on the node.
???
In the simplest case, conditional distributions are represented as conditional probability tables (CTPs).
class: middle
I am at work, neighbor John calls to say my alarm is ringing, but neighbor Mary does not call. Sometimes it's set off by minor earthquakes. Is there a burglar?
- Variables:
$\text{Burglar}$ ,$\text{Earthquake}$ ,$\text{Alarm}$ ,$\text{JohnCalls}$ ,$\text{MaryCalls}$ . - Network topology from "causal" knowledge:
- A burglar can set the alarm off
- An earthquake can set the alaram off
- The alarm can cause Mary to call
- The alarm can cause John to call
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
???
Blackboard: example of calculation, as in the next slide.
A Bayesian network implicitly encodes the full joint distribution as the product of the local distributions:
class: middle
Why does
- By the chain rule,
$P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | x_1, ..., x_{i-1})$ . - Provided that we assume conditional independence of
$X_i$ with its predecessors in the ordering given the parents, and provided$\text{parents}(X_i) \subseteq \{ X_1, ..., X_{i-1}\}$ :$$P(x_i | x_1, ..., x_{i-1}) = P(x_i | \text{parents}(X_i))$$ - Therefore
$P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | \text{parents}(X_i))$ .
class: middle
.grid[
.kol-1-2[.width-90[]]
.kol-1-2[.width-100[
]]
]
The topology of the network encodes conditional independence assertions:
-
$\text{Weather}$ is independent of the other variables. -
$\text{Toothache}$ and$\text{Catch}$ are conditionally independent given$\text{Cavity}$ .
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid.center[
.kol-1-3[.width-80[]]
.kol-2-3[.width-90[
]
]
]
.footnote[Image credits: CS188, UC Berkeley.]
.grid.center[
.kol-1-5[.width-60[]]
.kol-2-5[
|
] | ||
] |
???
Causal model
class: middle
.grid.center[
.kol-1-5[.width-60[]]
.kol-2-5[
|
] | ||
] |
.footnote[Image credits: CS188, UC Berkeley.]
???
Diagnostic model
Bayesian networks are correct representations of the domain only if each node is conditionally independent of its other predecessors in the node ordering, given its parents.
- Choose some ordering of the variables
$X_1, ..., X_n$ . - For
$i=1$ to$n$ :- Add
$X_i$ to the network. - Select a minimal set of parents from
$X_1, ..., X_{i-1}$ such that$P(x_i | x_1, ..., x_{i-1}) = P(x_i | \text{parents}(X_i))$ . - For each parent, insert a link from the parent to
$X_i$ . - Write down the CPT.
- Add
class: middle
.exercise[Do these networks represent the same distribution?]
???
For the left network:
- P (J|M ) = P (J)? No
- P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)? No
- P (B|A, J, M ) = P (B|A)? Yes
- P (B|A, J, M ) = P (B)? No
- P (E|B, A, J, M ) = P (E|A)? No
- P (E|B, A, J, M ) = P (E|A, B)? Yes
class: middle
- A CPT for boolean
$X_i$ with$k$ boolean parents has$2^k$ rows for the combinations of parent values. - Each row requires one number
$p$ for$X_i = \text{true}$ .- The number for
$X_i=\text{false}$ is just$1-p$ .
- The number for
- If each variable has no more than
$k$ parents, the complete network requires$O(n \times 2^k)$ numbers.- i.e., grows linearly with
$n$ , vs.$O(2^n)$ for the full joint distribution.
- i.e., grows linearly with
- For the burglary net, we need
$1+1+4+2+2=10$ numbers (vs.$2^5-1=31$ ). - Compactness depends on the node ordering.
Important question: Are two nodes independent given certain evidence?
- If yes, this can be proved using algebra (tedious).
- If no, this can be proved with a counter example.
.center[Example: Are
class: middle
.grid[
.kol-1-2[
Is
Counter-example:
- Low pressure causes rain causes traffic, high pressure causes no rain causes no traffic.
- In numbers:
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[
.kol-1-2[
Is
We say that the evidence along the cascade "blocks" the influence.
]
.kol-1-2.center[.width-100[]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-1-2[
Is
Counter-example:
- Project due causes both forums busy and lab full.
- In numbers:
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[
.kol-1-2[
Is
Observing the parent blocks the influence between the children.
]
.kol-1-2.center[.width-80[]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-1-2[
Are
- The ballgame and the rain cause traffic, but they are not correlated.
- (Prove it!)
Are
- Seeing traffic puts the rain and the ballgame in competition as explanation.
- This is backwards from the previous cases. Observing a child node activates influence between parents.
]
.kol-1-2.center[.width-80[
]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Let us assume a complete Bayesian network.
Are
Consider all (undirected) paths from
- If one or more active path, then independence is not guaranteed.
- Otherwise (i.e., all paths are inactive), then independence is guaranteed.
class: middle
.grid[ .kol-2-3[
A path is active if each triple is active:
- Cascade
$A \to B \to C$ where$B$ is unobserved (either direction). - Common parent
$A \leftarrow B \rightarrow C$ where$B$ is unobserved. - v-structure
$A \rightarrow B \leftarrow C$ where$B$ or one of its descendents is observed.
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-1-2[
-
$L \perp T' | T$ ? -
$L \perp B$ ? -
$L \perp B|T$ ? -
$L \perp B|T'$ ? -
$L \perp B|T, R$ ?
]
.kol-1-2.width-80.center[]
]
???
- Yes
- Yes
- (maybe)
- (maybe)
- Yes
exclude: true class: middle
A node
exclude: true class: middle
A node
- When the network reflects the true causal patterns:
- Often more compact (nodes have fewer parents).
- Often easier to think about.
- Often easier to elicit from experts.
- But, Bayesian networks need not be causal.
- Sometimes no causal network exists over the domain (e.g., if variables are missing).
- Edges reflect correlation, not causation.
- What do the edges really mean then?
- Topology may happen to encode causal structure.
- Topology really encodes conditional independence.
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[
.kol-3-4[
- Correlation does not imply causation.
- Causes cannot be expressed in the language of probability theory.
]
.kol-1-4[.circle.width-100[
].center[Judea Pearl]] ]
class: middle
Philosophers have tried to define causation in terms of probability:
However, the inequality
???
- Instead, the expression means that if we observe
$X=x$ , then the probability of$Y=y$ increases. - But this increase may come about for other reasons!
class: middle
The correct formulation should read
class: middle
- The reading in barometer is useful to predict rain.
$$P(\text{rain}|\text{Barometer}=\text{high}) > P(\text{rain}|\text{Barometer}=\text{low})$$ - But hacking a barometer will not cause rain!
$$P(\text{rain}|\text{Barometer hacked to high}) = P(\text{rain}|\text{Barometer hacked to low})$$
- Uncertainty arises because of laziness and ignorance. It is inescapable in complex non-deterministic or partially observable environments.
- Probabilistic reasoning provides a framework for managing our knowledge and beliefs, with the Bayes' rule acting as the workhorse for inference.
- A Bayesian Network specifies a full joint distribution. They are often exponentially smaller than an explicitly enumerated joint distribution.
- The structure of a Bayesian network encodes conditional independence assumptions between random variables.
class: end-slide, center count: false
The end.