L6: Second Winter, Reasoning Under Uncertainty Flashcards
What were the issues with expert systems?
- Classical formal logic is not suited for all problems.
- Performance issues on large knowledge bases.
- Expert elicitation—translating expert’s knowledge to a computer—is very hard.
- Many domain experts don’t actually reason using “if – then –” rules.
- Despite good PR, some academic “milestones” were not commercially viable.
Dendral is one such example; widely lauded, but never made it out of the lab.
Why is there a need for uncertainty representation?
Sometimes problems are difficult to describe with full certainty.
May be due to, e.g.,
* Inherent randomness in outcomes;
* Unknowns/uncertainties in domain description;
* Existence of many special cases.
What are different methods for representing uncertainty?
- (Classical) probability theory
- Various extensions and generalisations of probability theory
* Belief functions (Dempster-Shafer theory);
* Interval probabilities;
* Credal sets. - Fuzzy sets and logics
- Non-monotonic logics
What is a probability? What is the Bayesian and frequentist approach
Quantifies uncertainty about occurrence of some event of interest
- Bayesian probability: use subjective degree of belief P(A).
- Frequentist probability: use the limiting frequency of observations ω1, ω2, . . .
P(A) = lim n→+∞ 1/n sum(indicator function on w)
when indicator function = 1 if w in A otherwise 0.
What is a sample space?
What is an event?
What do you call a singleton event?
What is an outcome?
A sample set is a non-empty set Ω (represents atomic things that can happen- outcomes)
Outcome w in Ω is a specific possible outcome (Singleton event = atomic event)
An event is a subset A of Ω (describes something that can happen)
What is a sigma algebra over Ω?
A sigma-algebra over Ω is a set Σ of events such that
1. Contains sample space: Ω ∈ Σ
2. Closed under complements: If A ∈ Σ then also (Ω \ A) ∈ Σ
3. Closed under countable unions: If A1, A2, . . . ∈ Σ then also S
i Ai ∈ Σ
THIS IS THE EVENT SPACE
A sigma-algebra is a collection of subsets of X including X and the empty set.
Σ captures the events about which a probability measure P can say something.
What is a measurable space?
The tuple (Ω, Σ) is called a measurable space. The sigma algebra define the subsets that will be measured.
What is a probability measure?
A probability measure P : Σ → R on this space satisfies
* P(A) ≥ 0 for all A ∈ Σ
* P(Ω) = 1
* P(∪iAi) = sum P(Ai) for all disjoint A1, A2, . . . ∈ Σ.
What is a probability space?
The triple (Ω, Σ, P) is called a probability space
A probability space consists of three elements:
- A sample space, Ω, which is the set of all possible outcomes.
- An event space Σ, which is a set of events, an event being a set of outcomes in the sample space.
- A probability function P, which assigns each event in the event space a probability, which is a number between 0 and 1.
What is a random variable?
Random variables are functions that map elements of a probability outcome space Ω into another
outcome space (for us often real valued)
Let (Ω, Σ, P) be a probability space and let (W , F) be a measurable space.
A random variable (RV) X is a map X : Ω → W
What does it mean for an RV to be measurable?
This means that for all A ∈ F that
X^−1(A) = {ω ∈ Ω : X(ω) ∈ A} ∈ Σ
That for all A we can find values in the original sample space that give our value when mapped, and this set corresponds to an event in our event space (so we can assign a probability!)
we need to be able to compute probabilities with random variables, and the only way to do this is to find the pre-image of sets and compute probabilities with them. This requires that the pre-image of (measurable) sets be measurable, which is the definition of a measurable function.
To be more precise, suppose (Ω1,F1,P) is your probability space, (Ω2,F2) is your “target” space (this is usually taken to be R with its Borel sets). Then, suppose we want X:Ω1→Ω2 to be a random variable. This means that we want to be able to measure the probability of events like {X∈A} where A∈F2. Well, the way we do this is to find X^-1(A) , and try to compute P(X-1(A)). We can only do this if X^-1(A)∈F1, in other words, if X
is a measurable function..
Basically that X-1 on A is in our original sigma algebra. So for example if curly F has the event ‘two heads’ in it, then the inverse of this event also needs to be in sigma so we can assign a probability to it
How does an RV induce a probability measure P_x?
P_x(A) = P(X^-1(A))
The probability of event A in the the new measurable space is the same as the probability of the ‘parent’ event in our original probability space
What are multivariate probabilities?
Suppose we want to talk about multiple variables at once.
Simply take a sample space X = X1 × · · · × Xn
and some sufficiently expressive sigma-algebra F over X.
We take the RV X~ : Ω → X and build the probability space (X, F, P_X ).
It will be useful to denote the corresponding PMF as p(X1, . . . , Xn) = P_X({x1,…,xn}) for all (x1,…,xn) subset of X
Whenever a collection of random variables are mentioned, they are ALWAYS assumed to be defined on the same sample space.
What is the equation for the conditional PMF of X1 and X2?
p(X1|X2) = p(X1,X2)/p(X2) if p(X2) > 0
In general to condition an and statement, just need to divide by prob without the left hand prob in condition
What is the chain rule for probabilities?
The chain rule for probabilities states that
p(X1, . . . , Xn) = Product of p(Xik| Xi1, . . . , Xik−1) for all n
using the convention that for k = 1, we let p(Xik| Xi1, . . . , Xik−1) = p(Xi1)
What is probabilistic independence?
Probabilistic independence between variables X1, X2 iff their joint PMF factorises:
p(X1, X2) = p(X1)p(X2)
What is conditional independence?
Conditional independence between variables X1, X2 given X3 iff
p(X1, X2 | X3) = p(X1 | X3)p(X2 | X3)
We then write X1 ⊥ X2 | X3. (note if independent from X4 too can write X1 ⊥ X2, X4 | X3)
It follows that
X1 ⊥ X2 | X3 ⇒ p(X1 | X2, X3) = p(X1 | X3).
If we are specifying a model over multiple RVs X1,..X3 how many numerical parameters do we need? Assume binary
Takes 2^n-1 numerical parameters
If we have three X variables then we have 2^3 (i.e. 8) possible combinations of 0s and 1s, need to know prob of 7 of them (last can calculate by taking rest away from 1)
How many parameters do we need if we assume all RVs are independent?
n (sum of prob of 0 and 1 of each variable will equal 1)
What are Bayesian Networks?
Bayesian networks (BNs) were introduced by Judea Pearl in 1985.
* BNs are probabilistic graphical models.
* Encode conditional independencies between variables
Bayesian network encodes a PMF of variables X1, . . . , Xn.
Core of the representation is an acyclic directed graph (DAG) G = (V, E).
* The vertices are given by V = {X1, . . . , Xn}.
That is, the vertices of the graph correspond to the random variables in the model.
* Edges E indicate, roughly, “influence”/ dependency
Graph can often be specified by expert knowledge because presence of edges only indicates dependence between variables
BNs can be efficiently encoded in computer
What are conditional probability tables?
The collection of parents of Xi
in G is denoted pa(Xi)
Definition
A family of conditional probability tables (CPTs) for a given DAG G = (V, E)
is a map P(· | ·) that gives for all Xi ∈ V a conditional PMF P(Xi| pa(Xi)).
See tables on slide 62
CPTs=Either determined from expert knowledge or estimated from data.
What is the complexity of specific inferences in BN such as direct marginalisation?
NP-hard (some smarter and more efficient algorithms exist)