ML Flashcards
Unsupervised Learning and how is the model trained
Only input data provided and models learn to extract patterns from the data
Supervised Learning and how is the model trained
Each input paired with an output (target) and model trained to minimise the error
Regression
Finding the relationship between two variables where the model is linear
Classification
Category labels e.g. dog or bagel
Underfitting
A model that is too simple which does not fit the data
Overfitting
A model that fits minor variations or noise
Model Selection and how does it work
Selecting correct model
Split dataset for training and validation (and testing)
Training dataset
Used to train/optimise the model
Validation dataset
Used for validating the model
Test dataset
Used to test model for general fitting quality
Cross-validation
Split data into S groups so (S-1)/S data used for training
No free lunch theorem
All models are wrong, but some are useful
Model parameters
Values learned from training data
Parametric model
of parameters stay the same as quantity of data increases
Non-parametric model
of parameters increase/decrease as quantity of data increases
Likelihood function
Probability of data given model parameters
Maximum likelihood estimation
Method for estimating parameters of a probabilistic model
Linear regression model formula
p(y|x,w) = w^T x + e
What is the distribution of e in the linear regression model and what is the bias parameter in the linear regression model formula and what is it for
Gaussian distribution with mean 0: N(0, standard deviation squared)
Within the vector w by addition of dummy variable which always has value 1 and gives extra flexibility to fit the data 1
Linear regression to a feature vector
p(y|ϕ(x), w) = w^T ϕ(x) + e
Least-squares problem for linear regression formula
Theta = (X^T X)^-1 X^T y
Two key points about least-squares solutions
The solution has a closed-form
The solution is also the maximum likelihood solution
What does the Linear Discriminant compute (formula) and what class is x assigned to according to y
y=w^T x
y>=0 means C1
Otherwise C2
What assumptions are made for parameters to be learnt by applying MLE?
- Data for each class have a Gaussian distribution
- These two Gaussian distributions have the same covariance matrix
Logistic sigmoid function
sigmoid(a) = 1/(1+e^-a)
Logistic regression model for two classes
p(C1|x) = sigmoid(w^T x)
p(C2|x) = 1 - p(C1|x)
How can the MLE parameters for logistic regression be found?
Using gradient descent
Weights
Parameters that transform input data within each neuron
Activation function
Determines whether a neuron should be activated and how to transform the input signal into an output signal
Input layer and how many neurons
Consists of neurons that receive input data and pass to next layer - # of neurons = # of features in input dataset
Hidden Layer
Layer of neurons between input and output which processes input data using weights and activation functions
Output layer
Final layer that produces result/prediction
Cost/Loss function
Measures difference between predicted and true output
Forward pass process
1) Calculate activations of hidden layer, h
2) Pass result of step 1 through a nonlinear function e.g. sigmoid
3) Use step 2 to calculate activations of output layer, o
4) Compute predictions using sigmoid of step 3
Backpropagation / backward pass and formula and what do the symbols represent
Algorithm used to compute gradients of loss function with respect to each weight to update weights across multiple layers based on derivative of error wrt the weight
formula is δj = h′(aj )(Sum to k of:wkj δk)
δj = error signal for jth hidden unit
wkj = weight connecting hidden unit j to output unit k
h’(aj) = derivative of activation function
δ k = error signal at kth output unit
Vanishing gradient problem
When gradients used to update the weights during backpropagation become too small which slows down/stops learning process
Exploding gradient problem
When gradients grow too large during backpropagation, causing weights to become large and degrading model performance
Gradient clipping and formula and what do the variables mean
Used to cap magnitude of gradient :
g’ = min(1, c/||g|| ) g
c = constant value to limit size of gradient
Residual network and formula
Each layer has the form of a residual layer which is like a shortcut for gradients to flow directly from output to previous layer defined to be:
F1’ = F1(x)+x where F1(x) is the standard mapping (linear transformation and then actication)
Non-saturating activation function
Activation functions that don’t compress inputs into a limited range, avoiding vanishing gradients
Parameter initialisation significance
Necessary to choose good initial parameters to avoid issues with gradient descent
Early stopping
Process of creating a validation set and stopping training once the error on the validation step starts to increase
Weight decay formula
Adds a term: (lambda)w^T w to the loss function where user chooses value of lambda>0 to penalise large weights
Dropout and pro
When all outgoing connections from a unit with some given probability are turned off during training to ensure that each unit can perform well without depending on other units
This can drastically reduce overfitting
Classification tree
Decision tree where each internal node represents decision based on a feature and each leaf represents a class label
Regression tree
Decision tree where each internal node represents a decision based on a feature and each leaf node represents a predicted value
Are trees parametric or non-parametric?
Non-parametric
CART (Classification And Regression Trees) algorithm
Learning the tree structure greedy algorithm:
1) Start from root node
2) Run exhaustive search over each possible variable and threshold for a new node:
For each variable and threshold:
a. Compute average of target variable for each leaf of the proposed node
b. Compute error if we stop adding nodes here
3) Choose variable and threshold that minimise the error
4) Add a new node for the chosen variable and threshold
5) Repeat step 2 until there are only n data points associated with each leaf node
6) Prune back the tree to remove branches that do not reduce error by more than a small tolerance value, e
Pruning:
1) Start with a tree T0
2) Consider pruning each node in T0 by combining branches to obtain tree T
3) Compute C(T) = (Sum from T=1 to |T|) et(T) + lambda |T| where
- |T| is # of leaves
- et(T) is error associated with t’th leaf of tree
- lambda is a penalty added for the # of leaves in the tree
4) If C(T) <= C(T0) keep the pruned tree, otherwise reinstate the pruned node
Kernel function and example
Used to compute dot product between two data points e.g.
For two vectors, x and x’, possible kernel function K(x,x’) = ϕ(x).ϕ(x’) where ϕ is mapping to a higher-dimensional space
Kernel trick
We just need the kernel function to make a prediction so evaluate the kernel function values e.g. k(x1, x) without first computing ϕ(x1) and ϕ(x) and then compute their scalar product
Dual parameter
Indicate how much each training sample contributes to the decision boundary
Gram matrix
Defined to be K = ϕϕ^T so Knm = ϕ(xn)^T ϕ(xm) which is the similarity between the nth and mth datapoint
Margin
Distance from decision boundary (hyperplane) to closest training datapoint
Support vectors
Data points that lie closest to decision boundary (margin)
Soft margin
Allows points to lie on wrong side of hyperplane in exchange for a penalty which is controlled by a regularisation parameter
Slack variables
Measures distance of misclassified points from decision boundary
How can SVMs be extended to more than two classes? (for k classes)
One-versus-one = train k(k-1)/2 SVM classifiers to distinguish between each pair of classes and take a vote for the predicted class
One-versus-the-rest = train k SVM classifiers to distinguish between each class and all other classes
Bayesian network
Graphical model that represents a set of random variables and their conditional dependencies via a DAG
What is needed to construct a Bayesian network to represent a given joint distribution?
1) The DAG (structure of the BN)
2) Conditional probability distributions p(xk|pak) (parameters of the BN)
Collider
A node is a collider on a path if both arrows point to it on the path
Blocked path
A path is blocked by S if at least one of the following:
1) There is a collider that is not in S and none of its desendants are in S
2) There is a non-collider on the path that is in S
How can you represent the joint probability distribution for a Bayesian network?
If the network has nodes X1,X2,…,Xn the joint probability distribution is: P(X1,X2,..,Xn)=∏ i=1to n (P(Xi)|Parents(Xi))
What is plate notation
A plate is drawn around a subset of variables, and the number of repetitions is indicated on the plate.
How would you translate a ML model to a BN?
To translate a machine learning model into a Bayesian network, you identify the random variables in the model and their dependencies.
What does d-separated mean? What is its significance?
If all paths from node x to node y are blocked given nodes S then x and y are d-separated by S. This means x and y are conditionally independent given S in the DAG.
Prior distribution
Represents our beliefs about the parameters of a model before observing any data.
Posterior distribution
Combines the prior distribution and the likelihood to provide an updated belief about the parameters after observing the data.
What do we know about parameters and data in the Bayesian approach?
Parameters (unknown quantities we try to estimate) and latent variables (unobserved variables that influence the model) are treated as random variables
Data (observed variables) are also considered random but are known quantities
Ancestral sampling and example for: P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)
Generates samples from a joint probability distribution from parents to children
e.g. We first sample values for A and B, suppose we get A=0, B=1. Then we sample from C from the conditional distribution P(C|A=0, B=1) and so on
Rejection sampling example for P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)
If we wanted to sample from P(B,D|E=1) we would sample from the marginal distribution P(B,D,E) and throw away those samples where E!=1
Markov chain
Series of random variables z1,…,zm such that the following holds for mE{1,…,M-1}:
p(zm+1|z1,…,zm) = p(zm+1|zm)
Homogeneous Markov chain
If p(zm+1|zm) is the same for all m
Initial distribution in Markov chain
p(z1)
Transition distribution in Markov chain
It captures how likely it is to transition from the current state to any possible next state.
Markov chain Monte Carlo and significance of samples
Sample a sequence of distributions which eventually get close to desired distribution:
1) Draw sample from each distribution in the sequence
2) Only keep samples when we get close enough to desired distribution
*These samples are not independent
Target probability distribution
Used to construct a Markov chain
For Bayesian ML this will be P(theta|D=d), the posterior distribution of the model parameters given the observed data
Metropolis-Hastings Algorithm
Define single transition probability distribution for a homogeneous Markov chain.
Let current state be z^(T).
Generate a value z* by sampling from a proposal distribution q(z|z^(T))
Accept z* as the new state with certain acceptance probability in which case z^(T+1)=z. If we don’t accept z then stay where you are so z^(T+1)=z^(T)
Metropolis algorithm
Special case of the Metropolis-Hastings algorithm where the proposal distribution is symmetric (i.e., the probability of proposing a move to a state is the same as proposing a move back)
Proposal distribution
Distribution from which samples are drawn to general potential states in Markov chain
Acceptance probability for MH and M algorithm
MH: α(x,y) = min(1, π(y)q(y|x)/π(x)q(x|y) )
where π is target distribution and q is proposal distribution
M: α(x,y) = min(1, π(y)/π(x))
Burn-in and what for
Initial period of MCMC simulation during which samples are discarded because early samples might not represent the target distribution well due to influence of initial state
Convergence in MCMC
Process by which distribution of samples generated by Markov chain approaches the target distribution
Why do we run several chains during MCMC?
Helps to diagnose convergence and explore parameter space
What can a sample from a distribution be used for
Can be used to approximate an expected value defined by that distribution
What is R hat used for?
Value computed from an MCMC run used to check for convergence.
If the run is successful ie. there’s been a convergence, it will be close to 1
Clustering
Unsupervised learning technique to group similar data points into groups
Soft clustering
Assigns data points to clusters probabilistically by applying MLE to a Gaussian mixture model, rather than strictly assigning each point to single cluster
Gaussian mixture model
Probabilistic model that assumes data is generated from a mixture of several Gaussian distributions (type of soft clustering model)
Mixing coefficient
Represents the proportion of the kth component in the overall distribution as a probability between 0 and 1
Responsibility in mixture model
The probability that a specific data point was generated by a particular cluster in the model
K-means algorithm
- Randomly initialise K cluster centroids
- Assign each data point to the closest cluster centroid based on Euclidean distance
- Recompute the centroids by calculating the mean of all data points assigned to each cluster
- Repeat the assignment and update steps until convergence ie. when the cluster assignments no longer change significantly
EM algorithm purpose and steps
MLE for Gaussian mixtures:
Intialise by choosing starting variables for u (mean), covariance, pi
Then compute iterations of:
E step: Compute values for the responsibilities given the current parameter values
M step: Re-estimate the parameters using the current responsibilities
Hinderance of EM algorithm
There is no guarantee that it will succeed in maximising the log-likelihood. It may converge to a local maximum which is not a global maximum of the log-likelihood function.
3 EM equations
(log likelihood of observed data X given parameters theta) ln p(X∣θ)=L(q,θ)+KL(q∣∣p)
(Lower bound on log-likelihood) L(q,θ)= (sum to Z) q(Z) ln(p(X,Z∣θ)/q(Z))
q(Z) = distribution over latent variables Z
P(X, Z|θ) = joint probability of observed data X and latent variables Z given parameters
(Kullback-Leibler divergence between probability distributions p1 and p2) KL(q||p)= -(sum to Z) q(Z) ln(p(Z∣X,θ)/q(Z))
Key fact about Kullback-Leibler
KL(q||p) ≥ 0 for any choice of q, so L(q, θ) ≤ ln p(X|θ).
What is increased in E-step?
We increase L(q, θ) by updating q (and leaving θ fixed)
What is increased in M step?
We increase L(q, θ) by updating θ (and leaving q fixed)
Why does the EM algorithm work?
After the E-step we have L(q, θ) = ln p(X|θ) (and so KL(q||p) = 0), so that in the following M-step increasing L(q, θ) will also increase ln p(X|θ).
What is the ReLU function and where is it not differentiable (what do we do here instead)?
Straight line with slope 1 for positive values of 𝑥 and a horizontal line at 0 for negative values of 𝑥.
It is not differentiable at 0 but we just impose a gradient of 0.