ML Flashcards
Unsupervised Learning and how is the model trained
Only input data provided and models learn to extract patterns from the data
Supervised Learning and how is the model trained
Each input paired with an output (target) and model trained to minimise the error
Regression
Finding the relationship between two variables where the model is linear
Classification
Category labels e.g. dog or bagel
Underfitting
A model that is too simple which does not fit the data
Overfitting
A model that fits minor variations or noise
Model Selection and how does it work
Selecting correct model
Split dataset for training and validation (and testing)
Training dataset
Used to train/optimise the model
Validation dataset
Used for validating the model
Test dataset
Used to test model for general fitting quality
Cross-validation
Split data into S groups so (S-1)/S data used for training
No free lunch theorem
All models are wrong, but some are useful
Model parameters
Values learned from training data
Parametric model
of parameters stay the same as quantity of data increases
Non-parametric model
of parameters increase/decrease as quantity of data increases
Likelihood function
Probability of data given model parameters
Maximum likelihood estimation
Method for estimating parameters of a probabilistic model
Linear regression model formula
p(y|x,w) = w^T x + e
What is the distribution of e in the linear regression model and what is the bias parameter in the linear regression model formula and what is it for
Gaussian distribution with mean 0: N(0, standard deviation squared)
Within the vector w by addition of dummy variable which always has value 1 and gives extra flexibility to fit the data 1
Linear regression to a feature vector
p(y|ϕ(x), w) = w^T ϕ(x) + e
Least-squares problem for linear regression formula
Theta = (X^T X)^-1 X^T y
Two key points about least-squares solutions
The solution has a closed-form
The solution is also the maximum likelihood solution
What does the Linear Discriminant compute (formula) and what class is x assigned to according to y
y=w^T x
y>=0 means C1
Otherwise C2
What assumptions are made for parameters to be learnt by applying MLE?
- Data for each class have a Gaussian distribution
- These two Gaussian distributions have the same covariance matrix
Logistic sigmoid function
sigmoid(a) = 1/(1+e^-a)
Logistic regression model for two classes
p(C1|x) = sigmoid(w^T x)
p(C2|x) = 1 - p(C1|x)
How can the MLE parameters for logistic regression be found?
Using gradient descent
Weights
Parameters that transform input data within each neuron
Activation function
Determines whether a neuron should be activated and how to transform the input signal into an output signal
Input layer and how many neurons
Consists of neurons that receive input data and pass to next layer - # of neurons = # of features in input dataset
Hidden Layer
Layer of neurons between input and output which processes input data using weights and activation functions
Output layer
Final layer that produces result/prediction
Cost/Loss function
Measures difference between predicted and true output
Forward pass process
1) Calculate activations of hidden layer, h
2) Pass result of step 1 through a nonlinear function e.g. sigmoid
3) Use step 2 to calculate activations of output layer, o
4) Compute predictions using sigmoid of step 3
Backpropagation / backward pass and formula and what do the symbols represent
Algorithm used to compute gradients of loss function with respect to each weight to update weights across multiple layers based on derivative of error wrt the weight
formula is δj = h′(aj )(Sum to k of:wkj δk)
δj = error signal for jth hidden unit
wkj = weight connecting hidden unit j to output unit k
h’(aj) = derivative of activation function
δ k = error signal at kth output unit
Vanishing gradient problem
When gradients used to update the weights during backpropagation become too small which slows down/stops learning process
Exploding gradient problem
When gradients grow too large during backpropagation, causing weights to become large and degrading model performance
Gradient clipping and formula and what do the variables mean
Used to cap magnitude of gradient :
g’ = min(1, c/||g|| ) g
c = constant value to limit size of gradient
Residual network and formula
Each layer has the form of a residual layer which is like a shortcut for gradients to flow directly from output to previous layer defined to be:
F1’ = F1(x)+x where F1(x) is the standard mapping (linear transformation and then actication)