Terms Flashcards
Supervised Learning
In supervised learning, a learning algorithm(model) is trained with labeled data. The machine learning algorithm (model) learns from this labeled data in order to make predictions on unseen data.
Includes:
- Regression
- Classification
Linear Regression – objective
Given a labeled training set of m-training examples the objective is to find model parameters that minimize the cost function.
We can find the values of the parameters with Gradient Descent.
Gradient Descent:
Purpose
Algorithm components (2)
Convergence
Purpose: Gradient Descent is a general algorithm that is used to minimize different types of functions.
“Gradient” – a differentiable function of multiple variables
Choose random input values to start. Moves along the curve until we find the best spot – takes big steps when we’re far from the best spot and takes tiny steps as we get closer. It knows how far we are by looking at slope of the Sum of Squared Residuals curve – we are closer to optimal value when slope is close to 0. So we determine the size of the step by multiplying slope by a learning rate (alpha).
- -> here, the sum of squared residuals is the “Loss Function”
- -> Loss function depends on slope and intercept – get partial derivatives of loss function with respect to both slope and intercept
Whichever loss function you use, gradient descent process doesn’t change.
Gradient descent stops when either:
(1) the step size is very close to zero. (Versus least squares, which chooses point where slope is zero.)
(2) the pre-determined max number of steps have been taken (might cut the process off before we reach optimal point)
Steps:
- Take the gradient of the loss function (i.e., the partial derivative with respect to each parameter).
- Pick random values for the parameters
- Plug parameters into the gradient (the derivative)
- Calculate step sizes: step size = slope*learning rate
- Calculate new parameters: new = old - step size
Repeat steps 3, 4, 5 until step size is small or maximum steps has been reached.
Stochastic gradient descent:
- For when there are a large number of data points
- For each step, uses a random sample of the data points
- Reduces time spent calculating the derivatives of the loss function
Two components of the algorithm:
- Learning rate (alpha): determines size of step when updating model parameters
- -> If too small, then it is too slow/takes a long time to converge.
- -> If too large, we might overshoot and even diverge.
- Partial derivatives of cost function: gives us the direction of the steepest slope.
When does convergence occur:
Plot cost function against number of iterations performed by the algorithm.
Feature scaling
1 of the 2 most common techniques to adjust the range of features to be closer to each other.
This helps gradient descent algorithm to converge more quickly, with fewer iteration.
If range of values is too small: scale up.
If range of values is too large: scale down.
One technique for feature scaling is Mean Normalization.
Mean Normalization
A feature scaling technique.
X - X_min
/
(X_max - X_min)
Logistic Regression:
- what type of algorithm?
- what activation function?
- cost function?
Type of algorithm: classification
Activation functions:
Sigmoid (aka Logistic)
Cost function:
- Different from linear regression cost function (if we used that it would give us a wavy and irregular form with many local optima)
Sigmoid function/Logistic Function
An activation function in neural network training.
Used for binary classification – converts our output to values between 0 and 1.
f(X) = 1/(1 + e^(-X))
Positive values converge toward 1, negative values converge toward 0.
Often interpreted as a probability of positive classification.
Decision boundaries (binary, multi-class, nonlinear)
In binary classification – the line that separates positive and negative examples in dataset
In multi-class classification – this may be a set of lines dividing the categories
Non-linear decision boundaries might help obtain better models
multi-class classification
A “one-vs-all” (aka “one-vs-rest”) technique – trains n binary logistic classifiers for the n different classes in the dataset.
Sets the labels of a single class to positive and the labels of all others to negative.
Overfitting:
- definition, other names, when it occurs
- solutions
aka “high variance” – when we have too many features in our dataset
Solutions:
- Reduce number of features. Two ways:
- -> Manually select which to keep
- -> Use a model-selection algorithm - Regularization – helpful when we have a lot of slightly-useful features.
Underfitting:
- definition/other names
- what is its opposite?
aka “high bias” – when we don’t have enough training data.
Opposed to overfitting/”high variance”, when we have too many features in the dataset.
Neural networks
- description
- basic structure
- activation functions
- applications
Description:
A supervised learning algorithm. A good option when linear classifier doesn’t work.
Basic structure:
- input layer
- hidden layer
- output layer
Activation functions:
- computed in the hidden layer
- two most common:
1. Sigmoid
2. Relu (rectified linear unit)
- two most common:
Applications:
- binary classification
- multiclass classification
- solving regression problems
- character recognition, image compression, prediction problems
Training a neural network:
- basic description, aim
- training phase algorithms & functions
- steps
Description/aim:
- finding the best model parameters (weights & biases) that minimize error
- Iterative process
- Computationally expensive
Training phase mechanism:
Gradient descent algorithm uses (1) back-propagation, (2) cost function, to evaluate optimum model parameters.
Steps:
- randomly initialize weights
- forward propagation – in order to obtain output value for each training example
- compute cost function
- back propagation – in order to compute partial derivatives
- gradient checking – method to verify implementation of back propagation is working properly
- use either (a) gradient descent or (b) another built-in optimization function to minimize cost function by iteratively updating weights and biases
Back-propagation
- Description, aim, use
- How it works
Description/aim/use:
Algorithm used to find the optimal model parameters i.e weights and biases of a neural network based on the training data in an iterative manner.
How it works:
Computes partial derivatives of the cost function with respect to the weight and bias values.
MORE INTUITIVE:
When a parameter is unknown, for instance a bias term b_i:
1. Use chain rule to get the derivative of the loss function (eg. sum of squared residuals) with respect to that parameter (i.e. d SSR / d b_i)
2. Initialize unknown parameter it to a value
3. Use gradient descent to optimize unknown parameter
(Neural) network architecture
- What is meant by the term
- Options & benefits/drawbacks
Meaning:
the number of hidden layers to use (aka the “connectivity pattern” between neurons)
Options:
More hidden layers = better model performance but increased computational complexity.