1. Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label. 2. Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

1. label 2. feature 3. example 4. training 5. model 6. classification model 7. inference 8. regression model

1. empirical risk minimization 2. loss 3. mean squared error 4. squared loss 5. training

01-Introduction-Terminology Flashcards by Taranga Ghosh

What is (supervised) machine learning?

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

How well did you know this?

Not at all

Perfectly

Label

The diagnostic category

How well did you know this?

Not at all

Perfectly

Feature

A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features, specified as: {x_1, x_2 .. x_n} In the spam detector example, the features could include the following: words in the email text sender’s address time of day the email was sent email contains the phrase “one weird trick.”

How well did you know this?

Not at all

Perfectly

Example

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

labeled examples
unlabeled examples

How well did you know this?

Not at all

Perfectly

Model

A model defines the relationship between features and label.

For example, a spam detection model might associate certain features strongly with “spam”.

How well did you know this?

Not at all

Perfectly

Model Lifecycle

Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

How well did you know this?

Not at all

Perfectly

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

What is the value of a house in California?
What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:

Is a given email message spam or not spam?
Is this an image of a dog, a cat, or a hamster?

How well did you know this?

Not at all

Perfectly

Terminology

label
feature
example
training
model
classification model
inference
regression model

How well did you know this?

Not at all

Perfectly

Regression model: equation

y’ = b+Σw_i*x_i

How well did you know this?

Not at all

Perfectly

Linear regression: terminology

bias
inference
linear regression
weight
L₂loss

How well did you know this?

Not at all

Perfectly

L₂Loss

L2 Loss Function is used to minimize the error which is the sum of the all the squared differences between the true value and the predicted value.

L₂ = Σ(y-y’)²

MeanSquaredError = L₂/n

How well did you know this?

Not at all

Perfectly

Empirical risk minimization.

The process by which a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss.

How well did you know this?

Not at all

Perfectly

Loss: terminology

empirical risk minimization
loss
mean squared error
squared loss
training

How well did you know this?

Not at all

Perfectly

Gradient descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

How well did you know this?

Not at all

Perfectly

Initial value of w₁

The first stage in gradient descent is to pick a starting value (a starting point) for w_1. The starting point doesn’t matter much; therefore, many algorithms simply set w₁to 0 or pick a random value.

How well did you know this?

Not at all

Perfectly

Step

Study These Flashcards

A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step.

Gradient

Study These Flashcards

The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient is the the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

Partial derivative

Study These Flashcards

A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation.

The partial derivative f with respect to x, denoted as follows: ∂f/∂x

is the derivative of f considered as a function of x alone

To find if:

you must hold constant (so is now a function of one variable ), and take the regular derivative of with respect to . For example, when is fixed at 1, the preceding function becomes:

f ( x ) = e 2 sin ⁡ ( x )

Hyperparameter

Study These Flashcards

The “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Contrast with parameter.

Learning rate too small

Study These Flashcards

If you pick a learning rate that is too small, learning will take too long:

Learning rate too large

Study These Flashcards

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

value of weight wilossOvershoots theminimum!starting point

Figure 7. Learning rate is too large.

Batch, Stochastic Gradient, Mini-batch, Epoch

Study These Flashcards

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows.

By choosing examples at random, we could noisily estimate a big average from a much smaller one. Stochastic gradient descent(SGD) uses only a single random example per iteration. “Stochastic” indicates that the one example comprising each batch is chosen at random.

Mini-batch SGD is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 random examples. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Epoch: A full training pass over the entire data set such that each example has been seen once. Thus, an epoch represents N/batch_size training iterations, where N is the total number of examples.

Tensorflow seudo-code for a linear classifier

Study These Flashcards

import tensorflow as tf # Set up a linear classifier. classifier = tf.estimator.LinearClassifier(feature\_columns) # Train the model on some example data. classifier.train(input\_fn=train\_input\_fn, steps=2000) # Use it to predict. predictions = classifier.predict(input\_fn=predict\_input\_fn)

TensorFlow Basics

Study These Flashcards

Estimators : An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph and runs a TensorFlow session.
graph: TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor) as an operand to another operation.
tensor: The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.

TensorFlow hierarchy

TensorFlow Estimators: High-level, object-oriented API tf.layers,tf.losses,tf.metrics: Reusable libraries for common model components Python TensorFlow: Provides Ops, which wrap C++ Kernels C++ TensorFlow: Kernels CPU GPU TPU: Kernels work on one or more platforms

Bias and weight

*y' = b + Σwixi* y' is the predicted label (a desired output). b is the bias (the y-intercept), sometimes referred to as w0. w1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditional equation of a line. x1 is a feature (a known input).

Empirical risk minimization vs structural risk minimization

ERM: Choosing the function that minimizes loss on the training set SRM: An algorithm that balances two goals: 1. The desire to build the most predictive model (for example, lowest loss). 2. The desire to keep the model as simple as possible (for example, strong regularization). For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.

01-Introduction-Terminology Flashcards

(28 cards)