ML Flashcards

1
Q

Unsupervised Learning and how is the model trained

A

Only input data provided and models learn to extract patterns from the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised Learning and how is the model trained

A

Each input paired with an output (target) and model trained to minimise the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Regression

A

Finding the relationship between two variables where the model is linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Category labels e.g. dog or bagel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Underfitting

A

A model that is too simple which does not fit the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting

A

A model that fits minor variations or noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model Selection and how does it work

A

Selecting correct model
Split dataset for training and validation (and testing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Training dataset

A

Used to train/optimise the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Validation dataset

A

Used for validating the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Test dataset

A

Used to test model for general fitting quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cross-validation

A

Split data into S groups so (S-1)/S data used for training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

No free lunch theorem

A

All models are wrong, but some are useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Model parameters

A

Values learned from training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parametric model

A

of parameters stay the same as quantity of data increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non-parametric model

A

of parameters increase/decrease as quantity of data increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Likelihood function

A

Probability of data given model parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Maximum likelihood estimation

A

Method for estimating parameters of a probabilistic model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Linear regression model formula

A

p(y|x,w) = w^T x + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the distribution of e in the linear regression model and what is the bias parameter in the linear regression model formula and what is it for

A

Gaussian distribution with mean 0: N(0, standard deviation squared)

Within the vector w by addition of dummy variable which always has value 1 and gives extra flexibility to fit the data 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Linear regression to a feature vector

A

p(y|ϕ(x), w) = w^T ϕ(x) + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Least-squares problem for linear regression formula

A

Theta = (X^T X)^-1 X^T y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two key points about least-squares solutions

A

The solution has a closed-form
The solution is also the maximum likelihood solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the Linear Discriminant compute (formula) and what class is x assigned to according to y

A

y=w^T x
y>=0 means C1
Otherwise C2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What assumptions are made for parameters to be learnt by applying MLE?

A
  1. Data for each class have a Gaussian distribution
  2. These two Gaussian distributions have the same covariance matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Logistic sigmoid function

A

sigmoid(a) = 1/(1+e^-a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Logistic regression model for two classes

A

p(C1|x) = sigmoid(w^T x)
p(C2|x) = 1 - p(C1|x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How can the MLE parameters for logistic regression be found?

A

Using gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Weights

A

Parameters that transform input data within each neuron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Activation function

A

Determines whether a neuron should be activated and how to transform the input signal into an output signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Input layer and how many neurons

A

Consists of neurons that receive input data and pass to next layer - # of neurons = # of features in input dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Hidden Layer

A

Layer of neurons between input and output which processes input data using weights and activation functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Output layer

A

Final layer that produces result/prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Cost/Loss function

A

Measures difference between predicted and true output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Forward pass process

A

1) Calculate activations of hidden layer, h

2) Pass result of step 1 through a nonlinear function e.g. sigmoid

3) Use step 2 to calculate activations of output layer, o

4) Compute predictions using sigmoid of step 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Backpropagation / backward pass and formula and what do the symbols represent

A

Algorithm used to compute gradients of loss function with respect to each weight to update weights across multiple layers based on derivative of error wrt the weight

formula is δj = h′(aj )(Sum to k of:wkj δk)

δj = error signal for jth hidden unit
wkj = weight connecting hidden unit j to output unit k
h’(aj) = derivative of activation function
δ k = error signal at kth output unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Vanishing gradient problem

A

When gradients used to update the weights during backpropagation become too small which slows down/stops learning process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Exploding gradient problem

A

When gradients grow too large during backpropagation, causing weights to become large and degrading model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Gradient clipping and formula and what do the variables mean

A

Used to cap magnitude of gradient :
g’ = min(1, c/||g|| ) g
c = constant value to limit size of gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Residual network and formula

A

Each layer has the form of a residual layer which is like a shortcut for gradients to flow directly from output to previous layer defined to be:
F1’ = F1(x)+x where F1(x) is the standard mapping (linear transformation and then actication)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Non-saturating activation function

A

Activation functions that don’t compress inputs into a limited range, avoiding vanishing gradients

39
Q

Parameter initialisation significance

A

Necessary to choose good initial parameters to avoid issues with gradient descent

40
Q

Early stopping

A

Process of creating a validation set and stopping training once the error on the validation step starts to increase

41
Q

Weight decay formula

A

Adds a term: (lambda)w^T w to the loss function where user chooses value of lambda>0 to penalise large weights

42
Q

Dropout and pro

A

When all outgoing connections from a unit with some given probability are turned off during training to ensure that each unit can perform well without depending on other units

This can drastically reduce overfitting

43
Q

Classification tree

A

Decision tree where each internal node represents decision based on a feature and each leaf represents a class label

44
Q

Regression tree

A

Decision tree where each internal node represents a decision based on a feature and each leaf node represents a predicted value

45
Q

Are trees parametric or non-parametric?

A

Non-parametric

46
Q

CART (Classification And Regression Trees) algorithm

A

Learning the tree structure greedy algorithm:
1) Start from root node
2) Run exhaustive search over each possible variable and threshold for a new node:
For each variable and threshold:
a. Compute average of target variable for each leaf of the proposed node
b. Compute error if we stop adding nodes here
3) Choose variable and threshold that minimise the error
4) Add a new node for the chosen variable and threshold
5) Repeat step 2 until there are only n data points associated with each leaf node
6) Prune back the tree to remove branches that do not reduce error by more than a small tolerance value, e

Pruning:
1) Start with a tree T0
2) Consider pruning each node in T0 by combining branches to obtain tree T
3) Compute C(T) = (Sum from T=1 to |T|) et(T) + lambda |T| where
- |T| is # of leaves
- et(T) is error associated with t’th leaf of tree
- lambda is a penalty added for the # of leaves in the tree
4) If C(T) <= C(T0) keep the pruned tree, otherwise reinstate the pruned node

47
Q

Kernel function and example

A

Used to compute dot product between two data points e.g.
For two vectors, x and x’, possible kernel function K(x,x’) = ϕ(x).ϕ(x’) where ϕ is mapping to a higher-dimensional space

48
Q

Kernel trick

A

We just need the kernel function to make a prediction so evaluate the kernel function values e.g. k(x1, x) without first computing ϕ(x1) and ϕ(x) and then compute their scalar product

49
Q

Dual parameter

A

Indicate how much each training sample contributes to the decision boundary

50
Q

Gram matrix

A

Defined to be K = ϕϕ^T so Knm = ϕ(xn)^T ϕ(xm) which is the similarity between the nth and mth datapoint

51
Q

Margin

A

Distance from decision boundary (hyperplane) to closest training datapoint

52
Q

Support vectors

A

Data points that lie closest to decision boundary (margin)

53
Q

Soft margin

A

Allows points to lie on wrong side of hyperplane in exchange for a penalty which is controlled by a regularisation parameter

54
Q

Slack variables

A

Measures distance of misclassified points from decision boundary

55
Q

How can SVMs be extended to more than two classes? (for k classes)

A

One-versus-one = train k(k-1)/2 SVM classifiers to distinguish between each pair of classes and take a vote for the predicted class

One-versus-the-rest = train k SVM classifiers to distinguish between each class and all other classes

56
Q

Bayesian network

A

Graphical model that represents a set of random variables and their conditional dependencies via a DAG

57
Q

What is needed to construct a Bayesian network to represent a given joint distribution?

A

1) The DAG (structure of the BN)
2) Conditional probability distributions p(xk|pak) (parameters of the BN)

58
Q

Collider

A

A node is a collider on a path if both arrows point to it on the path

59
Q

Blocked path

A

A path is blocked by S if at least one of the following:

1) There is a collider that is not in S and none of its desendants are in S

2) There is a non-collider on the path that is in S

60
Q

How can you represent the joint probability distribution for a Bayesian network?

A

If the network has nodes X1,X2,…,Xn the joint probability distribution is: P(X1,X2,..,Xn)=∏ i=1to n (P(Xi)|Parents(Xi))

61
Q

What is plate notation

A

A plate is drawn around a subset of variables, and the number of repetitions is indicated on the plate.

62
Q

How would you translate a ML model to a BN?

A

To translate a machine learning model into a Bayesian network, you identify the random variables in the model and their dependencies.

63
Q

What does d-separated mean? What is its significance?

A

If all paths from node x to node y are blocked given nodes S then x and y are d-separated by S. This means x and y are conditionally independent given S in the DAG.

64
Q

Prior distribution

A

Represents our beliefs about the parameters of a model before observing any data.

65
Q

Posterior distribution

A

Combines the prior distribution and the likelihood to provide an updated belief about the parameters after observing the data.

66
Q

What do we know about parameters and data in the Bayesian approach?

A

Parameters (unknown quantities we try to estimate) and latent variables (unobserved variables that influence the model) are treated as random variables

Data (observed variables) are also considered random but are known quantities

67
Q

Ancestral sampling and example for: P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)

A

Generates samples from a joint probability distribution from parents to children
e.g. We first sample values for A and B, suppose we get A=0, B=1. Then we sample from C from the conditional distribution P(C|A=0, B=1) and so on

68
Q

Rejection sampling example for P(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|C)p(E|B,C)

A

If we wanted to sample from P(B,D|E=1) we would sample from the marginal distribution P(B,D,E) and throw away those samples where E!=1

69
Q

Markov chain

A

Series of random variables z1,…,zm such that the following holds for mE{1,…,M-1}:
p(zm+1|z1,…,zm) = p(zm+1|zm)

70
Q

Homogeneous Markov chain

A

If p(zm+1|zm) is the same for all m

71
Q

Initial distribution in Markov chain

A

p(z1)

72
Q

Transition distribution in Markov chain

A

It captures how likely it is to transition from the current state to any possible next state.

73
Q

Markov chain Monte Carlo and significance of samples

A

Sample a sequence of distributions which eventually get close to desired distribution:
1) Draw sample from each distribution in the sequence
2) Only keep samples when we get close enough to desired distribution

*These samples are not independent

74
Q

Target probability distribution

A

Used to construct a Markov chain
For Bayesian ML this will be P(theta|D=d), the posterior distribution of the model parameters given the observed data

75
Q

Metropolis-Hastings Algorithm

A

Define single transition probability distribution for a homogeneous Markov chain.
Let current state be z^(T).
Generate a value z* by sampling from a proposal distribution q(z|z^(T))
Accept z* as the new state with certain acceptance probability in which case z^(T+1)=z. If we don’t accept z then stay where you are so z^(T+1)=z^(T)

75
Q

Metropolis algorithm

A

Special case of the Metropolis-Hastings algorithm where the proposal distribution is symmetric (i.e., the probability of proposing a move to a state is the same as proposing a move back)

76
Q

Proposal distribution

A

Distribution from which samples are drawn to general potential states in Markov chain

77
Q

Acceptance probability for MH and M algorithm

A

MH: α(x,y) = min(1, π(y)q(y|x)/π(x)q(x|y) )
where π is target distribution and q is proposal distribution

M: α(x,y) = min(1, π(y)/π(x))

78
Q

Burn-in and what for

A

Initial period of MCMC simulation during which samples are discarded because early samples might not represent the target distribution well due to influence of initial state

79
Q

Convergence in MCMC

A

Process by which distribution of samples generated by Markov chain approaches the target distribution

80
Q

Why do we run several chains during MCMC?

A

Helps to diagnose convergence and explore parameter space

81
Q

What can a sample from a distribution be used for

A

Can be used to approximate an expected value defined by that distribution

82
Q

What is R hat used for?

A

Value computed from an MCMC run used to check for convergence.
If the run is successful ie. there’s been a convergence, it will be close to 1

83
Q

Clustering

A

Unsupervised learning technique to group similar data points into groups

84
Q

Soft clustering

A

Assigns data points to clusters probabilistically by applying MLE to a Gaussian mixture model, rather than strictly assigning each point to single cluster

85
Q

Gaussian mixture model

A

Probabilistic model that assumes data is generated from a mixture of several Gaussian distributions (type of soft clustering model)

86
Q

Mixing coefficient

A

Represents the proportion of the kth component in the overall distribution as a probability between 0 and 1

87
Q

Responsibility in mixture model

A

The probability that a specific data point was generated by a particular cluster in the model

88
Q

K-means algorithm

A
  • Randomly initialise K cluster centroids
  • Assign each data point to the closest cluster centroid based on Euclidean distance
  • Recompute the centroids by calculating the mean of all data points assigned to each cluster
  • Repeat the assignment and update steps until convergence ie. when the cluster assignments no longer change significantly
89
Q

EM algorithm purpose and steps

A

MLE for Gaussian mixtures:
Intialise by choosing starting variables for u (mean), covariance, pi

Then compute iterations of:
E step: Compute values for the responsibilities given the current parameter values

M step: Re-estimate the parameters using the current responsibilities

90
Q

Hinderance of EM algorithm

A

There is no guarantee that it will succeed in maximising the log-likelihood. It may converge to a local maximum which is not a global maximum of the log-likelihood function.

91
Q

3 EM equations

A

(log likelihood of observed data X given parameters theta) ln p(X∣θ)=L(q,θ)+KL(q∣∣p)

(Lower bound on log-likelihood) L(q,θ)= (sum to Z) q(Z) ln(p(X,Z∣θ)/q(Z))
q(Z) = distribution over latent variables Z
P(X, Z|θ) = joint probability of observed data X and latent variables Z given parameters

(Kullback-Leibler divergence between probability distributions p1 and p2) KL(q||p)= -(sum to Z) q(Z) ln(p(Z∣X,θ)/q(Z))

92
Q

Key fact about Kullback-Leibler

A

KL(q||p) ≥ 0 for any choice of q, so L(q, θ) ≤ ln p(X|θ).

93
Q

What is increased in E-step?

A

We increase L(q, θ) by updating q (and leaving θ fixed)

94
Q

What is increased in M step?

A

We increase L(q, θ) by updating θ (and leaving q fixed)

95
Q

Why does the EM algorithm work?

A

After the E-step we have L(q, θ) = ln p(X|θ) (and so KL(q||p) = 0), so that in the following M-step increasing L(q, θ) will also increase ln p(X|θ).

96
Q

What is the ReLU function and where is it not differentiable (what do we do here instead)?

A

Straight line with slope 1 for positive values of 𝑥 and a horizontal line at 0 for negative values of 𝑥.
It is not differentiable at 0 but we just impose a gradient of 0.