Unit 4: Graphical Models and Bayesian Networks Flashcards
Major problem with decision trees, SVMs and NNs
How to handle uncertainty and unobserved variables that require probabilistic reasoning.
E.g. how can we deal with any uncertainty in our input variables, or what confidence do we have in the output.
Undirected graphical models
A.k.a. Markov Random Fields (MRFs) or Markov Networks.
They represent the relationships between variables using undirected edges. The independence properties between the variables can be easily read off the graph through simple graph separation (missing edges / arcs).
Directed graphical models
A.k.a. Bayesian network or Belief network (BNs).
They are directed acyclic graphs (DAGs).
They consists of a set of nodes, together with a set of directed edges.
Each directed edge is a link from one node to another with the direction being important.
The edges must not form any loops.
Moralisation
A bayesian network can be converted into an MRF by connecting all the parents of a common child with edges and dropping the directions on the edges.
Maximum a posteriori query
(MAP query)
“What is the most likely explanation for some evidence?”
I.e. what is the value of the node X that maximises the probability that we would have seen this evidence.
“What value of x maximises P(x | e)?”
Bayesian networks
(a.k.a. Belief networks)
Provide a graphical representation of a domain that provides an intuitive way to model conditional independence dependencies and handle uncertainties.
Providing a compact specification of full joint probability distributions.
3 Components of Bayesian networks
- A finite set of variables.
- A set of directed edges between the nodes, that form an acyclic graph.
- A conditional probability distribution associated with each node P(Xᵢ | Parents( Xᵢ )). This is typically represented in a conditional probability table (CPT) indexed by values of the parent variables.
An uninstantiated node in a Bayesian Network
A node whose value is unknown.
Conditional independence
A and B are conditionally independent given C
if and only if
P( A⋂B | C ) = P(A|C) . P(B|C)
Denoted
( A ⟂ B ) | C
D-separation
Provides that:
If two variables, A & B, are d-separated given some evidence e, they are conditionally independent given the evidence.
P(A | B, e) = p(A | e)
Chain rule
The joint probability distribution is the product of all the individual probability distributions that are stored in the nodes of the graph.
P( X₁, …, Xₙ ) = ᴨ P( Xᵢ | Parents(Xᵢ) )
Parents of node Xᵢ
The nodes that have edges into Xᵢ.
Hybrid networks
Networks that contain both discrete and continuous variables.
Clique
A clique of a graph is a subset of nodes of the graph such that every two distinct vertices in the clique are adjacent.
Graph separation in an undirected graph
XA is conditional independant of XC given XB if there is no path from a ∈ A to c ∈ C after removing all variables in B.
Markov blanket (undirected graph)
The Markov blanket of a variable is precisely its neighbours in the graph.
Boltzmann Machine
Boltzmann machines learn the probability density from the input data to generating new samples from the same distribution.
It is a fully connected graph with pairwise potentials on binary-valued nodes.
They have a hidden layer of nodes and a visible layer of nodes.
Restricted Boltzmann Machines
A class of Boltzmann machine with a single hidden layer and a bipartite connection.
This means that every node in the visible layer is connected to every node in the hidden layer but the nodes in the same layer are not connected to each other.
Ising model
A mathematical model of ferromagnetism in statistical mechanics invented by Wilhelm Lenz (1920), who gave it as a problem to his student Ernst Ising.
Hopfield network
A fully connected Ising model with a symmetric weight matrix.
The main application of Hopfield networks is as an associative memory or content addressable memory.
We train the network on a set of fully observed bit vectors we wish to learn. The weights of the network and bias terms can be learned from the training data using (approximate) maximum likelihood.
Learning in fully observable Bayesian Networks
Known Structure
To learn the CPTs of a Bayesian network, we require a dataset D where D is made up of a set of cases or samples.
Each sample is a vector of values, one for each variable in the model.
The task is to find the model M that maximises Pr(D | M) (probability of the data given the model).
Bayesian Correction
A correction factor made by adding 1 to the count in the numerator and a value m to the denominator where m is the number of possible diferent values the variable can take on.
Inference by Stochastic Simulation
- Exact inference in a Bayesian network requires marginalising latent variables from the joint distribution.
- Marginalisation becomes quickly intractable in large networks.
- An alternative is to use approximate inference methods
Direct sampling
where we have an empty network with no evidence
- Sample parents
- Sample children from conditional probabilities P( X |Parents(X) )
- Continue until no variable is left
Main assumption made by direct sampling
The approach assumes no evidence is available in the network. It cannot be used to compute posterior probabilities.
(solution is to use Rejection sampling)