Terms Flashcards

1
Q

Supervised Learning

A

In supervised learning, a learning algorithm(model) is trained with labeled data. The machine learning algorithm (model) learns from this labeled data in order to make predictions on unseen data.

Includes:

  • Regression
  • Classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression – objective

A

Given a labeled training set of m-training examples the objective is to find model parameters that minimize the cost function.

We can find the values of the parameters with Gradient Descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Gradient Descent:

Purpose
Algorithm components (2)
Convergence

A

Purpose: Gradient Descent is a general algorithm that is used to minimize different types of functions.

“Gradient” – a differentiable function of multiple variables

Choose random input values to start. Moves along the curve until we find the best spot – takes big steps when we’re far from the best spot and takes tiny steps as we get closer. It knows how far we are by looking at slope of the Sum of Squared Residuals curve – we are closer to optimal value when slope is close to 0. So we determine the size of the step by multiplying slope by a learning rate (alpha).

  • -> here, the sum of squared residuals is the “Loss Function”
  • -> Loss function depends on slope and intercept – get partial derivatives of loss function with respect to both slope and intercept

Whichever loss function you use, gradient descent process doesn’t change.

Gradient descent stops when either:

(1) the step size is very close to zero. (Versus least squares, which chooses point where slope is zero.)
(2) the pre-determined max number of steps have been taken (might cut the process off before we reach optimal point)

Steps:

  1. Take the gradient of the loss function (i.e., the partial derivative with respect to each parameter).
  2. Pick random values for the parameters
  3. Plug parameters into the gradient (the derivative)
  4. Calculate step sizes: step size = slope*learning rate
  5. Calculate new parameters: new = old - step size

Repeat steps 3, 4, 5 until step size is small or maximum steps has been reached.

Stochastic gradient descent:

    • For when there are a large number of data points
    • For each step, uses a random sample of the data points
    • Reduces time spent calculating the derivatives of the loss function

Two components of the algorithm:

  • Learning rate (alpha): determines size of step when updating model parameters
  • -> If too small, then it is too slow/takes a long time to converge.
  • -> If too large, we might overshoot and even diverge.
  • Partial derivatives of cost function: gives us the direction of the steepest slope.

When does convergence occur:
Plot cost function against number of iterations performed by the algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Feature scaling

A

1 of the 2 most common techniques to adjust the range of features to be closer to each other.

This helps gradient descent algorithm to converge more quickly, with fewer iteration.

If range of values is too small: scale up.

If range of values is too large: scale down.

One technique for feature scaling is Mean Normalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Mean Normalization

A

A feature scaling technique.

X - X_min
/
(X_max - X_min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Logistic Regression:

  • what type of algorithm?
  • what activation function?
  • cost function?
A

Type of algorithm: classification

Activation functions:
Sigmoid (aka Logistic)

Cost function:
- Different from linear regression cost function (if we used that it would give us a wavy and irregular form with many local optima)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sigmoid function/Logistic Function

A

An activation function in neural network training.

Used for binary classification – converts our output to values between 0 and 1.

f(X) = 1/(1 + e^(-X))

Positive values converge toward 1, negative values converge toward 0.

Often interpreted as a probability of positive classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
Decision boundaries
(binary, multi-class, nonlinear)
A

In binary classification – the line that separates positive and negative examples in dataset

In multi-class classification – this may be a set of lines dividing the categories

Non-linear decision boundaries might help obtain better models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

multi-class classification

A

A “one-vs-all” (aka “one-vs-rest”) technique – trains n binary logistic classifiers for the n different classes in the dataset.

Sets the labels of a single class to positive and the labels of all others to negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Overfitting:

  • definition, other names, when it occurs
  • solutions
A

aka “high variance” – when we have too many features in our dataset

Solutions:

  1. Reduce number of features. Two ways:
    - -> Manually select which to keep
    - -> Use a model-selection algorithm
  2. Regularization – helpful when we have a lot of slightly-useful features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Underfitting:

  • definition/other names
  • what is its opposite?
A

aka “high bias” – when we don’t have enough training data.

Opposed to overfitting/”high variance”, when we have too many features in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Neural networks

    • description
    • basic structure
    • activation functions
    • applications
A

Description:
A supervised learning algorithm. A good option when linear classifier doesn’t work.

Basic structure:

  1. input layer
  2. hidden layer
  3. output layer

Activation functions:

    • computed in the hidden layer
    • two most common:
      1. Sigmoid
      2. Relu (rectified linear unit)

Applications:

    • binary classification
    • multiclass classification
    • solving regression problems
    • character recognition, image compression, prediction problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Training a neural network:

    • basic description, aim
    • training phase algorithms & functions
    • steps
A

Description/aim:

    • finding the best model parameters (weights & biases) that minimize error
    • Iterative process
    • Computationally expensive

Training phase mechanism:
Gradient descent algorithm uses (1) back-propagation, (2) cost function, to evaluate optimum model parameters.

Steps:

  1. randomly initialize weights
  2. forward propagation – in order to obtain output value for each training example
  3. compute cost function
  4. back propagation – in order to compute partial derivatives
  5. gradient checking – method to verify implementation of back propagation is working properly
  6. use either (a) gradient descent or (b) another built-in optimization function to minimize cost function by iteratively updating weights and biases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Back-propagation

    • Description, aim, use
    • How it works
A

Description/aim/use:
Algorithm used to find the optimal model parameters i.e weights and biases of a neural network based on the training data in an iterative manner.

How it works:
Computes partial derivatives of the cost function with respect to the weight and bias values.

MORE INTUITIVE:
When a parameter is unknown, for instance a bias term b_i:
1. Use chain rule to get the derivative of the loss function (eg. sum of squared residuals) with respect to that parameter (i.e. d SSR / d b_i)
2. Initialize unknown parameter it to a value
3. Use gradient descent to optimize unknown parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(Neural) network architecture

    • What is meant by the term
    • Options & benefits/drawbacks
A

Meaning:
the number of hidden layers to use (aka the “connectivity pattern” between neurons)

Options:
More hidden layers = better model performance but increased computational complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Improving Machine Learning model/algorithm (aka “debugging”)

– options/steps for high-variance and high-bias

A

high variance:

    • more training samples
    • fewer features
    • increasing lambda (regularization hyperparameter)

high bias:

    • more features
    • including polynomial features
    • decreasing lambda (regularization hyperparameter)
17
Q

Evaluating a hypothesis/model performance

A

the article isn’t very good here

also talks about train/cross-validation/test sets

18
Q

Poor performance

    • 2 types of model illnesses
    • how to diagnose
    • effect of using more training examples on each of these illnesses
A

Illnesses:

    • High bias/under-fitting
    • High variance/over-fitting

Diagnose:

    • check error on both train and cross-validation sets
    • If both high –> under-fitting/high bias
    • If training error is low but validation error is high –> over-fitting/high variance

LEARNING CURVES (image saved to desktop) can help diagnose the problem

Solutions:

    • High bias: cannot be solved with more training examples
    • High variance: can be improved with more training examples – will decrease cross-validation cost error
19
Q

Evaluation metrics for classification models:

Precision
Recall
F1

    • intuition
    • formulas
    • why F1 and not just the mean of the two measures?
A

Precision:
How many of the “positive” labels are correct. E.g. when you say a drink is coke and not pepsi, you’re usually right.
TP/(TP+FP)

Recall:
Percentage of actual positives that you correctly identify. I.e. in a group of sodas, the number of cokes you correctly identify.
TP/(TP+FN)

F1:
Harmonic mean of the two (balances both concerns).
(2PR)/(P+R)

Why F1 instead of just the regular average?
Averaging can sometimes result in high value even when one of the measures is low. With the harmonic mean, if either is low, F1 will be low.

20
Q

Support Vector Machines

    • description
    • objective
    • hyperparameter
    • output
    • versus logistic regression
A

Description:

    • Classifier
    • cost function similar to logistic regression
    • Convex cost function: no problem with local minima/maxima

Objective:
Compute best parameters (thetas) that minimize cost function

Hyperarameter:

    • In the cost function
    • C = 1/lambda
    • large C/small lambda:
  • —> model parameters penalized less
  • —> high complexity
  • —> overfitting
    • small C/large lambda:
  • —> high penalization of model parameters
  • —> decreased model complexity
  • —> underfitting

Output:
– 1 for positive class, 0 for negative class

Versus logistic regression:
– logistic regression ranges from 0 to 1 but is not binary – can take any value in between, representing probability

21
Q

Unsupervised learning

    • definition
    • applications
A

Definition:
data isn’t labeled.
Not trying to predict a variable, but rather discover patterns within the data – I.e. clusters

Applications:
Marketing – group users according to multiple characteristics

22
Q

DBScan

    • What type of algorithm? What type of ML?
    • Description
    • Downsides
A

A clustering (unsupervised learning) algorithm.

Select a radius and a point – all points within the radius’s distance are added to the cluster. Repeat for each new point added to the cluster.

Downsides: effectiveness relies heavily on radius choice, doesn’t deal with certain distributions well.

23
Q

k-means clustering

    • significance
    • description
    • downsides
A

Significance: the most widespread clustering algorithm

Description:

    • Pre-define the number of clusters we want (k)
    • Randomly place k starting points in dataset – called “centroids”
    • Assign each remaining data point to the cluster closest to it
    • Calculate midpoint of each cluster, then redefine the centroid as that midpoint. Reassign each data point to a centroid based on this new setup.
    • Continue for a pre-determined number of times (300 is standard). By the end, centroid movement should be minimal.
24
Q

Hierarchical clustering

    • Basic description
    • Two types
    • Process of the more common type
A

Description:
A clustering method where clusters are assigned based on hierarchical relationships between data points

Types:

    • Bottom-up (agglomeration): more common – easier, mathematically
    • Top-down (divisive)

Bottom-up process:

  1. Assign each data point to its own cluster so the number of initial clusters (k) == number data points (n)
  2. Compute distances between clusters
  3. Merge two closest clusters
  4. Continue computing distances/merging until all points are in a single cluster