Machine Learning Basics Flashcards

Question 1

Q

What is a machine learning algorthim

Answer

A

An algrothim that learns from data.

Question 2

Q

Name the most common Machine Learning tasks

Answer

A

The common tasks are classification, classification with missing input, regression, transcription, machine translation, structured output, anomaly detection, synthesis and sampling, imputation of missing values, denoising, density estimation or probability mass function estimation

Question 3

Q

What is the classification machine learning task?

Answer

A

A task that catagorizes inputs into k seperate catagories. It predicts a function that takes in a vector of n real values. Alternatively, the output of the function may be the probability distribution over the catagories.

Question 4

Q

What is the classification with missing inputs machine learning task?

Answer

A

This is a classification task where all the inputs are not gaurnteeded to be present in every example. A naive approach is to learn the set of functions, where each function corresponds to an input containing a subset of the original input. These subsets represent different inputs missing. A more sophisticated approach is learning the probability distribution over all inputs. Use marginalizing to remove the missing inputs. Therefore, only one joint probability distribution is learned and we don’t need to learn 2^n functions for n inputs.

Question 5

Q

What is the regression machine learning task?

Answer

A

This machine learning task takes a vector of real numbers and produces real number. It predicts a single value given a set of input.

Question 6

Q

What is the transcription machine learning task?

Answer

A

This machine learning task takes in unstructed signal data like audio or images and produces textual information. For example, aduio waves turned into words or images of addresses turned into text.

Question 7

Q

What is the machine translation machine learning task?

Answer

A

This machine learning tasks takes a sequence of symbols and translates them into another sequence of symbomls. Think of natural language translation.

Question 8

Q

What is the structured output machine learning task?

Answer

A

This machine learning task involves all tasks where the output is a vector. or collection of data where the relationship between elements in the data is important. This task includes the transcription and translation tasks. Think of an image captioning system where the words must form a valid sentence.

Question 9

Q

What is the Anomaly detection machine learning task?

Answer

A

This machine learning tasks shift through data containing objects or events and identifies which are anomalies. Think of credit card fraud detection.

Question 10

Q

What is the synthesis and sampling machine learning task?

Answer

A

This is a machine learning task where the algorhtim produces an example that is similar to the provided data set.

Question 11

Q

What is the imputation of missing values machine learning task?

Answer

A

This machine learning task is given a new example with some entries missing. The machine learning algorithm must fill in the missing values.

Question 12

Q

What is the denoising machine learning task?

Answer

A

This machine learning task takes in a corrupted (or noisy signal) and predicts it’s correlating clean signal.

Question 13

Q

What is the density estimation or probability mass function estimation machine learning task?

Answer

A

This type of machine learning task creates a probability mass function or probability density function over the set of all inputs.

Question 14

Q

What are unsupervised learning algorithms?

Answer

A

An algorithm that experiences the entire data set and identifies useful properties and structure. Typically, the entire probability distribution of the underlying data set is desired. However, this may not be explicitly solved for as in the denoising task. Other times, the algorithm clusters similar data.

Question 15

Q

What are supervised machine learning algorithms?

Answer

A

An algorithm that experiences the entire data set, but where the dataset is labeled. Classification is a common example of supervised machine learning.

Question 16

Q

Describe supervised and unsupervised algorithms in terms of the corresponding probability function they are attempting to predict.

Answer

A

Supervised algorithms predict the output given some input or conditional probability. Unsupervised algorithms predict the probability density function governing the underlying data.

Question 17

Q

What is the mean squared error of a predicator?

Answer

A

A function which evaluates the error between a predicted vector and an actual vector. It compares the values of each vector, summing their square.

If we have two vectors y and x, the mean squared error is the Euclidian distance of y - x divided by the number of elements. The Euclidian distance is just L2 norm squared or the sum over all components of the difference between y and x, squared: SUM (i, (y_i - x_i) ^ 2). The total is

1/m * SUM (y_i - x_i) ^ 2)

Mean Squared Error of a Predictor

Question 18

Q

What are the i.i.d. assumptions?

Answer

A

The assumption the data in a data set are independent of one another and that the training set and testing set are identically distributed. To be identically distributed means that the data comes from identical probability distributions. This allows for the test and train data to be drawn from the same underlying distribution.

Question 19

Q

What is underfitting?

Answer

A

When a model does not achieve a low enough error value for the training data.

Question 20

Q

What is overfitting?

Answer

A

When the gap in error value between the training data and test data is too large.

Question 21

Q

What is the hypothesis space of a machine learning model?

Answer

A

All the hypothetical functions the machine learning model can select from. For a linear regression model, this space is all linear functions of its input. The capacity of a

Question 22

Q

What is the capacity of a machine learning model?

Answer

A

A model’s ability to fit a variety of functions. More capacity means more functions are available to it.

Question 23

Q

How is capacity, underfitting, and overfitting related?

Answer

A

Models with more capacity can overfit by memorizing the test data, while models with less capacity can underfit by not producing a function that matches the training data.

Question 24

Q

How are capacity and hypothesis space related?

Answer

A

Adding more variety (for example polynomials instead of only linear functions) to the hypothesis space increases the model’s capacity.

Question 25

Q

What is the key result of statistical learning theory that affects capacity, number of training examples, and machine learning models?

Answer

A

The capacity of the model provides an upper limit to the difference between training and test error values. Higher capacity equals a higher upper limit. As the number of training examples increases, this upper bound shrinks. It’s important to note that the capacity of a deep learning model is difficult and not practical to compute exactly.

Question 26

Q

What is the Bayes error?

Answer

A

Imagine a perfect model which accurately predicts the probability distribution used to generate the data. The error is the error between the perfect model and the probability distribution of the underlying data.

Question 27

Q

What is the no-free lunch theorem?

Answer

A

A theorem that states when averaged over all possible data-generating distributions, every classification algorithm has the same error rate for unseen data points. In other words, you cannot build a universal machine learning algorithm that solves every type of classification task for all possible data sets.. Instead, one should aim to make a machine learning algorithm that produces the best results for the most expected, real world-input.

Question 28

Q

What is weight decay?

Answer

A

Adjust the cost function to account for the total weight by taking the square of the L2 norm of the weights times a hyperparameter lambda. A higher lambda will punish larger weights (high slope in the case of linear regression), while a small near zero lambda will allow for larger weights. Another way to think of this is less features are utilized.

Question 29

Q

Describe regularization

Answer

A

A process to reduce the generalization error but not the training error.

Question 30

Q

When is the validation set?

Answer

A

A set of examples different from the training and test set. The set is used to estimate the generalization error of the model and to adjust the hyperparameters accordingly.

Question 31

Q

What is the k-fold cross-validation algorithm?

Answer

A

This algorithm is used when too little data is available to properly estimate the generalization error of a machine learning algorithm using a standard train/test split of the data. Break the data into k non-overlapping test subsets of the dataset. Perform k trails, using the kth subset as the test set. Average the error across the k trials to approximate the generalization error.

Question 32

Q

What is a point estimator or statistic?

Answer

A

A function that attempts to provide the best estimate of a quantity. For example, it can be the parameters of a parametric function or a whole function. The formal definition is a function whose input is a set of i.i.d. training examples that approximate some quantity, like the parameters of a learning algorithm. It does not need to predict the true parameter values to be an estimator. The resulting estimate of the parameters is a random variable itself because it is a function of data that is generated through a random process.

Question 33

Q

What is a function estimation?

Answer

A

A point estimation that refers to the relationship between an input and an output or a point estimator in function space. We would like to predict a function that takes in some input and produces an output.

Question 34

Q

What is the bias of an estimator?

Answer

A

The difference between the mean of the predicted output and the true output.

Question 35

Q

What does it mean for an estimator to be unbiased?

Answer

A

The bias is zero.

Question 36

Q

What does it mean for an estimator to be asymptotically unbiased?

Answer

A

As the number of examples approaches infinity, the bias becomes zero.

Question 37

Q

What is the standard error of an estimator?

Answer

A

The square root of the variance of the estimated parameter.

Question 38

Q

Why does the sample mean (sum of all samples divided by their total count) tend to have a variance proportional to the sample’s true variance and inversely proportional to the number of samples?

Answer

A

The variance of random variables with a multiplicative constant is the product of the variance of the variable and the square of the constant. See Variance and its Properties.

Question 39

Q

Name two ways of quantitatively comparing two estimators

Answer

A

Cross-validation and mean squared error

Question 40

Q

How does mean squared error of an estimator compare with bias and variance?

Answer

A

It is the sum of the bias squared and variance which is equal to the expectation value of the difference between the estimated and expected value squared (see mean squared error of an estimator).

Question 41

Q

How is capacity, variance and bias related for an estimator?

Answer

A

Increasing capacity tends to increase variance and decrease bias. More bias means underfitting the model while more variance means overfitting.

Question 42

Q

What is consistency?

Answer

A

The parameter approaches the true parameter as the number of examples increases. See Consistency and convergence in probability.

Question 43

Q

Consistency is the common name for what specific form of consistency?

Answer

A

Weak consistency.

Question 44

Q

What is strong consistency?

Answer

A

Almost sure convergence of the estimated parameter to the true parameter. See almost convergence equation.

Question 45

Q

Describe the difference between almost convergence and convergence in probability?

Answer

A

Almost sure convergence implies convergence in probability.

Question 46

Q

How are consistency and bias related?

Answer

A

Consistency means that bias diminishes with more examples. However, asymptotic unbiasedness does not imply consistency.

Question 47

Q

What is KL divergence?

Answer

A

A measure of the dissimilarity between two distributions. See KL Divergence equation

Question 48

Q

What is another phrase for minimizing KL divergence?

Answer

A

Minimizing the cross entropy.

Question 49

Q

What is the maximum likelihood estimator?

Answer

A

Given a set of examples generated from an unknown distribution, create a model from a parametric family of probability distributions that estimate the unknown distribution. Think of each example as individual random variables where each random variable is independent and identically distributed. The example data represents a specific value for each random variable. Feed the set of values for each random variable into the model. Choose the parameter that provides the highest probability. See the maximum likelihood estimator for the equation when the log is applied and when normalized by the number of examples.

Question 50

Q

How is minimizing the KL divergence related to the maximum likelihood estimator?

Answer

A

They are equivalent mathematically. Maximizing a set of example’s probability in a model is the same as minimizing the difference between the model distribution and the example data distribution. See the KL Divergence and Maximum Likelihood Estimation equations.

Question 51

Q

Generalize maximum likelihood estimation for conditional probability. Also explain why this is important for machine learning.

Answer

A

If we are given a set of examples and their corresponding responses or outputs, this can be captured by a conditional probability or what is the probability of the output given the example inputs. Using a parameterized family of probability distributions again as a model, the model can be decomposed into a sum of logs over each example and corresponding output. This is because the examples are assumed to be i.i.d.. Note that this is very similar to the nonconditional probability version of maximum likelihood estimation, except we have not normalized by the number of samples to get the expectation value. See Maximum likelihood estimation for conditional probability. This is important because supervised machine learning often deals with conditional probability.

Question 52

Q

What conditions are necessary for the maximum likelihood to be consistent?

Answer

A

(1) The true distribution used to generate the data must be within the family of distributions represented by the model.
(2) The true distribution must correspond to one and only one parameter value. Otherwise, the model will not know which parameter was used by the data generating process.

Question 53

Q

When calculating the maximum likelihood estimation, a probability distribution was introduced called the empirical distribution. How does this differ from the data generating distribution?

Answer

A

The empirical distribution is defined by the examples provided to the model. The data-generating distribution is unknown and generates the examples. The empirical distribution approximates the data-generating one. If the examples are one-dimensional, a probability mass function in the form of a histogram can be generated from the examples. This is taken as a probability distribution that is fully defined by the examples. Therefore, the examples can be used to calculate the true expectation value of this distribution.

Question 54

Q

What does it mean for a consistent estimator to have statistical efficiency?

Answer

A

It means the estimator will obtain lower generalization error for a fixed number of samples or said another way, require less samples to reach a specific generalization error.

Question 55

Q

Why is maximum likelihood the preferred estimator for machine learning?

Answer

A

It is consistent and statistically more efficient (measured in mean squared error) then all other consistent estimators.

Question 56

Q

Compare and contrast Bayesian and Frequentist statistics as it relates to estimators.

Answer

A

Frequentist statistics estimates a single parameter, then makes all predications based off the parameter. The true parameter value is fixed, while the estimated parameter is a random value based on random examples. Bayesian statistics considers all possible values of the parameter when making a prediction. It considers the examples not random (it is directly observed and therefore completely known), but the true parameter as random.

Question 57

Q

What is the prior probability distribution and how does it estimate a parameter?

Answer

A

A distribution that represents the parameter to be estimated. It is used in Bayes’ rule to predict the parameter given knowledge of the examples. Bayes rules relates the conditional estimate to the prior distribution and a conditional probability between examples and model parameters,

Question 58

Q