Final: Reasoning under Uncertainty Flashcards
(L23) What is a random variable, domain and its characteristics?
- Variable that an agent can be uncertain about its value
- dom(X): set of values that X can take
- Mutually exclusive and exhaustive
(L23) What is the probability distribution of a random variable?
P on a random variable X is a function dom(X) → [0, 1] such that for som value x → P(X = x)
(L23) Give examples of random variables?
(Discrete) Roll a dice,
dom = 1, …, 6
Probability distribution: even across all values of the dice
(Boolean) Raining outside
(L23) w╞ X=x
variable X = x in world w
(L23) What are the semantics of probability?
- µ(w): probability of world w
- \sum w \in W of µ(w) = 1
- P(f)= \ sum w⊧ f of µ(w) = 1 (sum the probabilities of the worlds where f is true)
(L23) Calculate the probability of proposition f given u(w) for the set of possible worlds
P(f)= \ sum w⊧ f of µ(w) = 1 (sum the probabilities of the worlds where f is true)
(L23) Define a joint probability distribution
- Their joint distribution is a probability distribution over the variables’ Cartesian product
- Sum of entries across the whole table is 1
(L24) Can you compute the probability distribution for each variable?
YES, by marginalization
(L24) Can you compute the probability distribution for any combination of variables?
YES, by marginalization we can compute the distribution over any subset of variables
(L24) Can you update these probabilities if you know something about the variables?
YES
(L24) DOWNSIDE of the joint probability distribution (JDC)
Size is exponential in the number of variables
(L24) What is normalization?
Rule out worlds and divide by p (new known information)
(L24) Prove the formula to compute conditional probability P(h | e) = P(h, e)/ P(e)
p.16
(L24) Derive the Product Rule, Chain Rule and Bayes’ Rule
p.17
(L24) Two ways to use conditional probability for inference
- Causal knowledge (from cause to evidence) P(evidence| cause hypothesis)
- Evidential reasoning (from evidence to cause) P(hypothesis| evidence)
(L25) Define Marginal Independence + interpretation
For all x_i in dom(X) and all y_k dom(Y), P(X= x_i | Y= y_k ) = P(X= x_i) –> P(X, Y) = P(X) * P(Y)
- New evidence Y (or X) does not affect current belief in X (or Y)
Joint probability Distribution (JPD) requiring 2^4 = 16 entries is reduced to two smaller ones (2^3 = 8 and 2).
(L25) Define Conditional Independence + interpretation
For all x_i \in dom(X), y_k \in dom(Y), z_m \in dom(Z),
P(X = x_i | Y = y_k, Z = z_m) = P(X = x_i | Z = z_m)
Value of Y doesn’t affect the belief in the value of X, given the value of Z.
(L25) Space complexity of P(X1, …, Xn), with n independent random variables
Be independency, P(X1, …, Xn) = P(X1) * … * P(Nn)
O(d^n) → O(n*d) space complexity:
(L25) Two equivalent statements:
Given catch is conditionally independent of toothache given cavity.
- P(tooth | catch, cavity) = P(tooth | cavity)
2. P(tooth, catch | cavity) = P(tooth | cavity) * P(catch | cavity)
(L25) How many values do we store in JPD, CPT?
Joint Probability Distributions (JPD): has O(d^n) values
Since all of them sum to 1, we store O(d^n) - 1 values
Conditional Probability Table (CPT): has O(d^n) values
Each row sums to 1, we store all but 1 per row.
(L25) Uses of conditional independence
- Reduce the size of the joint distribution from exponential n to linear n where n is the # of variables.
- Most basic and robust form of knowledge about uncertain environments.
(L26) What is a Belief Network?
Directed Acyclic Graph that effectively expresses independence assertions among random variables
(L26) What is constant and change for a system that collects some evidence?
Constant: Environment - Values of the evidence - True causes of the evidence Change: - Set of evidence collected so far - Own belief in the cause of the fault
(L26) Steps to build a Belief Network for a simple domain
- Apply Chain Rule
- Simplify according to marginal & conditional independence
- Express dependencies as a network
- It can be derived directly from the causal knowledge
(L26) State and draw the types of inferences (Burglary, Alarm, JohnCalls, Earthquake)
Diagnostic: Burglary –> Alarm –> JohnCalls
Predictive: Burglary –> Alarm –> JohnCalls
Inter-Causal: Burglary –> Alarm Alarm –> JohnCalls
(L26) Compute the representational saving in terms of number of probabilities required
- Compute the total number of joint distribution entries - 1. product of domains
- Compute the product with the new domain sizes, for every node
CPT number of values (P(C_1 | L_1, N)) = dom(L1) * dom(N) * (dom(C1) - 1. - Repeat and sum all CPTs
- Calculate the representational saving.
(L26) How many rows are in a Conditional Probability Table for a boolean variable X with k parents?
2^k rows
(L26) Let n boolean variables with no more than k parents, how many values to be stored?
O(n*2^k) values to be stored
(L26) For n variables, how many values in the JPD need to be stored?
O(2^n)
(L27) Two assumptions of Noisy-OR distributions.
TWO ASSUMPTIONS
- All causes are listed (1)
- For each cause, whatever inhibits it from generating the effect is independent from the inhibitors of the other causes. (2)
(L27) Benefits of Noisy-OR distributions
- Far fewer probabilities need to be specified.
- Number of probabilities is linear in the number of parents.
(L27) Define Noisy-OR distributions.
Model to describe relations between variables in one CPT of a Bayesian network - Models non-interacting causes
(L27) Assumptions of NaÏve Bayesian classifier
TWO ASSUMPTIONS
- The value of each attribute depends on the classification
- (Naïve) Attributes are independent of each other given the classification
(L27) Benefits of NaÏve Bayesian classifier
- Less number of parameters
- Easy to acquire (e.g. if you have a large collection of emails for which you know whether ot not they are spam)
(L 27) How do you classify using the NaÏve Bayesian classifier
Choose the most likely class given the set of observations: P(Spam = T| …) > P(Spam = F | …)
(L27) Definition of a NaÏve Bayesian classifier
A very simple and successful BNet that allows us to classify entities in a set of classes, given a set of attributes.
(L29) Techniques to simplify variable elimination.
Use the techniques to simplify variable elimination.
- Variable Elimination → All the variables which the query is conditionally independent given the observations can be prune from the BNet
- Unobserved leaf nodes → can be pruned
(L29) Carry out variable elimination on a BNet by using factor representation and using the factor operations.
p.28
(L30) What is a Markov Chain?
A single random variable for each time slice. Let S_t represent a state.
(L30) What is the Markov Assumption?
Each random variable depends only on the previous one.
P(S_{t+ 1} | S_0, …, S_t) = P(S_{t+1} | S_t)
(L30) What is the Stationary Process Assumption?
The mechanism that regulates how state variables change over time is stationary. It can be described by a single transitional model (a single CPT)
P(S_{t+1} | S_t) is the same for all t
(L30) Specify Markov Chain and compute the probability of a sequence of states
By the Stationary assumption, and Markov assumption, we check the stochastic transition matrix, and P(S_0), and the joint probability is given by:
P(S_0, …, S_t) = P(S_0) * P(S_1 | S_0) * P(S_2 | S_1) …. P(S_{t-1} | S_{t-2}) * P(S_t | S_{t - 1})
(L30) Describe the key problems in natural language processing
- Assign a probability to a sentence (Machine translation)
- Predict the next word (Speech recognition)
Assuming:
~ 10^5 words in a language
~ 10 words per sentence
→ We need probabilities for 10^50 sentences
(L30) Justify and apply Markov Chains to compute the probability of a Natural Language sentence
Write down the probability of sequence of states.
Count sequential pairs of words
Count the number of words
Find the bigram P(wi | w_{i - 1})
What is a Hidden Markov Chain HMM?
The reasoning system does not have access to the states but can make observations that give some information about the current state.
Specify the components of a Hidden Markov Model (HMM). (conditions, dynamics, sensor model representations and domains)
- Starts with a Markov Chain
- Adds noisy observation about the state at each time state
P(S0) specifies initial conditions
P(S_t+1 | S_t) specifies the dynamics
P(O_t | S_t) specifies the sensor model
Size of P(S_0) = k
Size of P(S_{t+1} | S_t) = k * k
Size of P(O_t | S_t) = k * h
Can we prune in HMM?
When we use the hidden markov model, all of the observations will always be known. Even if we know the actions that are taken as well, then there is nothing we can prune. For a hidden markov model, we won’t know the states.
Justify and apply HMMs to problems such as robot localization
p.34
Compare and contrast stochastic single-stage (one-off) decisions vs multistage decisions
Single-stage decisions:
Set of one or more primitive decisions that can be treated as a single macro decision to be made before acting, with no prior observations.
Represented as a decision tree
- The agent has a set of decision to make (a macro-action it can perform)
- Decisions can influence random variables
- Decisions have probability distributions over outcomes
Multistage decisions
- Set of one or more decisions each of which depends on observations
Define a Utility Function on possible worlds
Weighted average utility across possible worlds probabilities
How are decision nodes and utility nodes represented in single stage decision networks?
Decision nodes: the agent chooses the value (rectangles)
Utility node: Parents are variables on which the utility function depends (diamond)
Compute optimal decisions by Variable Elimination
- Create a factor for each conditional probability and for the utility
- Multiply factors and sum out all the random variables (this creates a factor that gives the expected utility for each decision)
- Choose the decision with the maximum value in the factor