01-Introduction-Terminology Flashcards

1
Q

What is (supervised) machine learning?

A

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Label

A

The diagnostic category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Feature

A

A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: {x_1, x_2 .. x_n} In the spam detector example, the features could include the following: words in the email text sender’s address time of day the email was sent email contains the phrase “one weird trick.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Example

A

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

  1. labeled examples
  2. unlabeled examples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model

A

A model defines the relationship between features and label.

For example, a spam detection model might associate certain features strongly with “spam”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model Lifecycle

A
  1. Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
  2. Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’). For example, during inference, you can predict medianHouseValue for new unlabeled examples.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression vs. classification

A

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

  1. What is the value of a house in California?
  2. What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:

  1. Is a given email message spam or not spam?
  2. Is this an image of a dog, a cat, or a hamster?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Terminology

A
  1. label
  2. feature
  3. example
  4. training
  5. model
  6. classification model
  7. inference
  8. regression model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Regression model: equation

A

y’ = b+Σwi*xi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Linear regression: terminology

A
  1. bias
  2. inference
  3. linear regression
  4. weight
  5. L2 loss
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

L2 Loss

A

L2 Loss Function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.

L2 = Σ(y-y’)2

MeanSquaredError = L2/n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Empirical risk minimization.

A

The process by which a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Loss: terminology

A
  1. empirical risk minimization
  2. loss
  3. mean squared error
  4. squared loss
  5. training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Gradient descent

A

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Initial value of w1

A

The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn’t matter much; therefore, many algorithms simply set w1 to 0 or pick a random value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Step

A

A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.

17
Q

Gradient

A

The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient is the the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

18
Q

Partial derivative

A

A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation.

The partial derivative f with respect to x, denoted as follows: ∂f/∂x

is the derivative of f considered as a function of x alone

To find if:

you must hold constant (so is now a function of one variable ), and take the regular derivative of with respect to . For example, when is fixed at 1, the preceding function becomes:

f ( x ) = e 2 sin ⁡ ( x )

19
Q

Hyperparameter

A

The “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Contrast with parameter.

20
Q

Learning rate too small

A

If you pick a learning rate that is too small, learning will take too long:

21
Q

Learning rate too large

A

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

value of weight wilossOvershoots theminimum!starting point

Figure 7. Learning rate is too large.

22
Q

Batch, Stochastic Gradient, Mini-batch, Epoch

A

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows.

By choosing examples at random, we could noisily estimate a big average from a much smaller one. Stochastic gradient descent(SGD) uses only a single random example per iteration. “Stochastic” indicates that the one example comprising each batch is chosen at random.

Mini-batch SGD is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 random examples. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Epoch: A full training pass over the entire data set such that each example has been seen once. Thus, an epoch represents N/batch_size training iterations, where N is the total number of examples.

23
Q

Tensorflow seudo-code for a linear classifier

A
import tensorflow as tf # Set up a linear classifier. classifier = tf.estimator.LinearClassifier(feature\_columns) # Train the model on some example data. classifier.train(input\_fn=train\_input\_fn, steps=2000) # Use it to predict. predictions = classifier.predict(input\_fn=predict\_input\_fn)
24
Q

TensorFlow Basics

A
  1. Estimators : An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a TensorFlow session.
  2. graph: TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor) as an operand to another operation.
  3. tensor: The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.
25
Q

TensorFlow hierarchy

A

TensorFlow Estimators: High-level, object-oriented API

tf.layers,tf.losses,tf.metrics: Reusable libraries for common model components

Python TensorFlow: Provides Ops, which wrap C++ Kernels

C++ TensorFlow: Kernels

CPU GPU TPU: Kernels work on one or more platforms

26
Q
A
27
Q

Bias and weight

A

y’ = b + Σwixi

y’ is the predicted label (a desired output).

b is the bias (the y-intercept), sometimes referred to as w0.

w1 is the weight of feature 1. Weight is the same concept as the “slope” m in the traditional equation of a line.

x1 is a feature (a known input).

28
Q

Empirical risk minimization vs structural risk minimization

A

ERM: Choosing the function that minimizes loss on the training set

SRM:

An algorithm that balances two goals:

  1. The desire to build the most predictive model (for example, lowest loss).
  2. The desire to keep the model as simple as possible (for example, strong regularization).

For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.