Terms Flashcards

Question 1

Q

Supervised Learning

Answer

A

In supervised learning, a learning algorithm(model) is trained with labeled data. The machine learning algorithm (model) learns from this labeled data in order to make predictions on unseen data.

Includes:

Regression
Classification

Question 2

Q

Linear Regression – objective

Answer

A

Given a labeled training set of m-training examples the objective is to find model parameters that minimize the cost function.

We can find the values of the parameters with Gradient Descent.

Question 3

Q

Gradient Descent:

Purpose
Algorithm components (2)
Convergence

Answer

A

Purpose: Gradient Descent is a general algorithm that is used to minimize different types of functions.

“Gradient” – a differentiable function of multiple variables

Choose random input values to start. Moves along the curve until we find the best spot – takes big steps when we’re far from the best spot and takes tiny steps as we get closer. It knows how far we are by looking at slope of the Sum of Squared Residuals curve – we are closer to optimal value when slope is close to 0. So we determine the size of the step by multiplying slope by a learning rate (alpha).

-> here, the sum of squared residuals is the “Loss Function”
-> Loss function depends on slope and intercept – get partial derivatives of loss function with respect to both slope and intercept

Whichever loss function you use, gradient descent process doesn’t change.

Gradient descent stops when either:

(1) the step size is very close to zero. (Versus least squares, which chooses point where slope is zero.)
(2) the pre-determined max number of steps have been taken (might cut the process off before we reach optimal point)

Steps:

Take the gradient of the loss function (i.e., the partial derivative with respect to each parameter).
Pick random values for the parameters
Plug parameters into the gradient (the derivative)
Calculate step sizes: step size = slope*learning rate
Calculate new parameters: new = old - step size

Repeat steps 3, 4, 5 until step size is small or maximum steps has been reached.

Stochastic gradient descent:

- For when there are a large number of data points
- For each step, uses a random sample of the data points
- Reduces time spent calculating the derivatives of the loss function

Two components of the algorithm:

Learning rate (alpha): determines size of step when updating model parameters
-> If too small, then it is too slow/takes a long time to converge.
-> If too large, we might overshoot and even diverge.
Partial derivatives of cost function: gives us the direction of the steepest slope.

When does convergence occur:
Plot cost function against number of iterations performed by the algorithm.

Question 4

Q

Feature scaling

Answer

A

1 of the 2 most common techniques to adjust the range of features to be closer to each other.

This helps gradient descent algorithm to converge more quickly, with fewer iteration.

If range of values is too small: scale up.

If range of values is too large: scale down.

One technique for feature scaling is Mean Normalization.

Question 5

Q

Mean Normalization

Answer

A

A feature scaling technique.

X - X_min
/
(X_max - X_min)

Question 6

Q

Logistic Regression:

what type of algorithm?
what activation function?
cost function?

Answer

A

Type of algorithm: classification

Activation functions:
Sigmoid (aka Logistic)

Cost function:
- Different from linear regression cost function (if we used that it would give us a wavy and irregular form with many local optima)

Question 7

Q

Sigmoid function/Logistic Function

Answer

A

An activation function in neural network training.

Used for binary classification – converts our output to values between 0 and 1.

f(X) = 1/(1 + e^(-X))

Positive values converge toward 1, negative values converge toward 0.

Often interpreted as a probability of positive classification.

Question 8

Q

Decision boundaries
(binary, multi-class, nonlinear)

Answer

A

In binary classification – the line that separates positive and negative examples in dataset

In multi-class classification – this may be a set of lines dividing the categories

Non-linear decision boundaries might help obtain better models

Question 9

Q

multi-class classification

Answer

A

A “one-vs-all” (aka “one-vs-rest”) technique – trains n binary logistic classifiers for the n different classes in the dataset.

Sets the labels of a single class to positive and the labels of all others to negative.

Question 10

Q

Overfitting:

definition, other names, when it occurs
solutions

Answer

A

aka “high variance” – when we have too many features in our dataset

Solutions:

Reduce number of features. Two ways:
- -> Manually select which to keep
- -> Use a model-selection algorithm
Regularization – helpful when we have a lot of slightly-useful features.

Question 11

Q

Underfitting:

definition/other names
what is its opposite?

Answer

A

aka “high bias” – when we don’t have enough training data.

Opposed to overfitting/”high variance”, when we have too many features in the dataset.

Question 12

Q

Neural networks

- description
- basic structure
- activation functions
- applications

Answer

A

Description:
A supervised learning algorithm. A good option when linear classifier doesn’t work.

Basic structure:

input layer
hidden layer
output layer

Activation functions:

- computed in the hidden layer
- two most common:
  1. Sigmoid
  2. Relu (rectified linear unit)

Applications:

- binary classification
- multiclass classification
- solving regression problems
- character recognition, image compression, prediction problems

Question 13

Q

Training a neural network:

- basic description, aim
- training phase algorithms & functions
- steps

Answer

A

Description/aim:

- finding the best model parameters (weights & biases) that minimize error
- Iterative process
- Computationally expensive

Training phase mechanism:
Gradient descent algorithm uses (1) back-propagation, (2) cost function, to evaluate optimum model parameters.

Steps:

randomly initialize weights
forward propagation – in order to obtain output value for each training example
compute cost function
back propagation – in order to compute partial derivatives
gradient checking – method to verify implementation of back propagation is working properly
use either (a) gradient descent or (b) another built-in optimization function to minimize cost function by iteratively updating weights and biases

Question 14

Q

Back-propagation

- Description, aim, use
- How it works

Answer

A

Description/aim/use:
Algorithm used to find the optimal model parameters i.e weights and biases of a neural network based on the training data in an iterative manner.

How it works:
Computes partial derivatives of the cost function with respect to the weight and bias values.

MORE INTUITIVE:
When a parameter is unknown, for instance a bias term b_i:
1. Use chain rule to get the derivative of the loss function (eg. sum of squared residuals) with respect to that parameter (i.e. d SSR / d b_i)
2. Initialize unknown parameter it to a value
3. Use gradient descent to optimize unknown parameter

Question 15

Q

(Neural) network architecture

- What is meant by the term
- Options & benefits/drawbacks

Answer

A

Meaning:
the number of hidden layers to use (aka the “connectivity pattern” between neurons)

Options:
More hidden layers = better model performance but increased computational complexity.

Question 16

Q

Improving Machine Learning model/algorithm (aka “debugging”)

– options/steps for high-variance and high-bias

Answer

A

high variance:

- more training samples
- fewer features
- increasing lambda (regularization hyperparameter)

high bias:

- more features
- including polynomial features
- decreasing lambda (regularization hyperparameter)

Question 17

Q

Evaluating a hypothesis/model performance

Answer

A

the article isn’t very good here

also talks about train/cross-validation/test sets

Question 18

Q

Poor performance

- 2 types of model illnesses
- how to diagnose
- effect of using more training examples on each of these illnesses

Answer

A

Illnesses:

- High bias/under-fitting
- High variance/over-fitting

Diagnose:

- check error on both train and cross-validation sets
- If both high –> under-fitting/high bias
- If training error is low but validation error is high –> over-fitting/high variance

LEARNING CURVES (image saved to desktop) can help diagnose the problem

Solutions:

- High bias: cannot be solved with more training examples
- High variance: can be improved with more training examples – will decrease cross-validation cost error

Question 19

Q

Evaluation metrics for classification models:

Precision
Recall
F1

- intuition
- formulas
- why F1 and not just the mean of the two measures?

Answer

A

Precision:
How many of the “positive” labels are correct. E.g. when you say a drink is coke and not pepsi, you’re usually right.
TP/(TP+FP)

Recall:
Percentage of actual positives that you correctly identify. I.e. in a group of sodas, the number of cokes you correctly identify.
TP/(TP+FN)

F1:
Harmonic mean of the two (balances both concerns).
(2PR)/(P+R)

Why F1 instead of just the regular average?
Averaging can sometimes result in high value even when one of the measures is low. With the harmonic mean, if either is low, F1 will be low.

Question 20

Q

Support Vector Machines

- description
- objective
- hyperparameter
- output
- versus logistic regression

Answer

A

Description:

- Classifier
- cost function similar to logistic regression
- Convex cost function: no problem with local minima/maxima

Objective:
Compute best parameters (thetas) that minimize cost function

Hyperarameter:

- In the cost function
- C = 1/lambda
- large C/small lambda:
—> model parameters penalized less
—> high complexity
—> overfitting
- small C/large lambda:
—> high penalization of model parameters
—> decreased model complexity
—> underfitting

Output:
– 1 for positive class, 0 for negative class

Versus logistic regression:
– logistic regression ranges from 0 to 1 but is not binary – can take any value in between, representing probability

Question 21

Q

Unsupervised learning

- definition
- applications

Answer

A

Definition:
data isn’t labeled.
Not trying to predict a variable, but rather discover patterns within the data – I.e. clusters

Applications:
Marketing – group users according to multiple characteristics

Question 22

Q

DBScan

- What type of algorithm? What type of ML?
- Description
- Downsides

Answer

A

A clustering (unsupervised learning) algorithm.

Select a radius and a point – all points within the radius’s distance are added to the cluster. Repeat for each new point added to the cluster.

Downsides: effectiveness relies heavily on radius choice, doesn’t deal with certain distributions well.

Question 23

Q

k-means clustering

- significance
- description
- downsides

Answer

A

Significance: the most widespread clustering algorithm

Description:

- Pre-define the number of clusters we want (k)
- Randomly place k starting points in dataset – called “centroids”
- Assign each remaining data point to the cluster closest to it
- Calculate midpoint of each cluster, then redefine the centroid as that midpoint. Reassign each data point to a centroid based on this new setup.
- Continue for a pre-determined number of times (300 is standard). By the end, centroid movement should be minimal.

Question 24

Q

Hierarchical clustering

- Basic description
- Two types
- Process of the more common type

Answer

A

Description:
A clustering method where clusters are assigned based on hierarchical relationships between data points

Types:

- Bottom-up (agglomeration): more common – easier, mathematically
- Top-down (divisive)

Bottom-up process:

Assign each data point to its own cluster so the number of initial clusters (k) == number data points (n)
Compute distances between clusters
Merge two closest clusters
Continue computing distances/merging until all points are in a single cluster