Machine Learning Basics Flashcards

1
Q

What is a machine learning algorthim

A

An algrothim that learns from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name the most common Machine Learning tasks

A

The common tasks are classification, classification with missing input, regression, transcription, machine translation, structured output, anomaly detection, synthesis and sampling, imputation of missing values, denoising, density estimation or probability mass function estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the classification machine learning task?

A

A task that catagorizes inputs into k seperate catagories. It predicts a function that takes in a vector of n real values. Alternatively, the output of the function may be the probability distribution over the catagories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the classification with missing inputs machine learning task?

A

This is a classification task where all the inputs are not gaurnteeded to be present in every example. A naive approach is to learn the set of functions, where each function corresponds to an input containing a subset of the original input. These subsets represent different inputs missing. A more sophisticated approach is learning the probability distribution over all inputs. Use marginalizing to remove the missing inputs. Therefore, only one joint probability distribution is learned and we don’t need to learn 2^n functions for n inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the regression machine learning task?

A

This machine learning task takes a vector of real numbers and produces real number. It predicts a single value given a set of input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the transcription machine learning task?

A

This machine learning task takes in unstructed signal data like audio or images and produces textual information. For example, aduio waves turned into words or images of addresses turned into text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the machine translation machine learning task?

A

This machine learning tasks takes a sequence of symbols and translates them into another sequence of symbomls. Think of natural language translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the structured output machine learning task?

A

This machine learning task involves all tasks where the output is a vector. or collection of data where the relationship between elements in the data is important. This task includes the transcription and translation tasks. Think of an image captioning system where the words must form a valid sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Anomaly detection machine learning task?

A

This machine learning tasks shift through data containing objects or events and identifies which are anomalies. Think of credit card fraud detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the synthesis and sampling machine learning task?

A

This is a machine learning task where the algorhtim produces an example that is similar to the provided data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the imputation of missing values machine learning task?

A

This machine learning task is given a new example with some entries missing. The machine learning algorithm must fill in the missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the denoising machine learning task?

A

This machine learning task takes in a corrupted (or noisy signal) and predicts it’s correlating clean signal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the density estimation or probability mass function estimation machine learning task?

A

This type of machine learning task creates a probability mass function or probability density function over the set of all inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are unsupervised learning algorithms?

A

An algorithm that experiences the entire data set and identifies useful properties and structure. Typically, the entire probability distribution of the underlying data set is desired. However, this may not be explicitly solved for as in the denoising task. Other times, the algorithm clusters similar data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are supervised machine learning algorithms?

A

An algorithm that experiences the entire data set, but where the dataset is labeled. Classification is a common example of supervised machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe supervised and unsupervised algorithms in terms of the corresponding probability function they are attempting to predict.

A

Supervised algorithms predict the output given some input or conditional probability. Unsupervised algorithms predict the probability density function governing the underlying data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the mean squared error of a predicator?

A

A function which evaluates the error between a predicted vector and an actual vector. It compares the values of each vector, summing their square.

If we have two vectors y and x, the mean squared error is the Euclidian distance of y - x divided by the number of elements. The Euclidian distance is just L2 norm squared or the sum over all components of the difference between y and x, squared: SUM (i, (y_i - x_i) ^ 2). The total is

1/m * SUM (y_i - x_i) ^ 2)

Mean Squared Error of a Predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the i.i.d. assumptions?

A

The assumption the data in a data set are independent of one another and that the training set and testing set are identically distributed. To be identically distributed means that the data comes from identical probability distributions. This allows for the test and train data to be drawn from the same underlying distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is underfitting?

A

When a model does not achieve a low enough error value for the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is overfitting?

A

When the gap in error value between the training data and test data is too large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the hypothesis space of a machine learning model?

A

All the hypothetical functions the machine learning model can select from. For a linear regression model, this space is all linear functions of its input. The capacity of a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the capacity of a machine learning model?

A

A model’s ability to fit a variety of functions. More capacity means more functions are available to it.

23
Q

How is capacity, underfitting, and overfitting related?

A

Models with more capacity can overfit by memorizing the test data, while models with less capacity can underfit by not producing a function that matches the training data.

24
Q

How are capacity and hypothesis space related?

A

Adding more variety (for example polynomials instead of only linear functions) to the hypothesis space increases the model’s capacity.

25
Q

What is the key result of statistical learning theory that affects capacity, number of training examples, and machine learning models?

A

The capacity of the model provides an upper limit to the difference between training and test error values. Higher capacity equals a higher upper limit. As the number of training examples increases, this upper bound shrinks. It’s important to note that the capacity of a deep learning model is difficult and not practical to compute exactly.

26
Q

What is the Bayes error?

A

Imagine a perfect model which accurately predicts the probability distribution used to generate the data. The error is the error between the perfect model and the probability distribution of the underlying data.

27
Q

What is the no-free lunch theorem?

A

A theorem that states when averaged over all possible data-generating distributions, every classification algorithm has the same error rate for unseen data points. In other words, you cannot build a universal machine learning algorithm that solves every type of classification task for all possible data sets.. Instead, one should aim to make a machine learning algorithm that produces the best results for the most expected, real world-input.

28
Q

What is weight decay?

A

Adjust the cost function to account for the total weight by taking the square of the L2 norm of the weights times a hyperparameter lambda. A higher lambda will punish larger weights (high slope in the case of linear regression), while a small near zero lambda will allow for larger weights. Another way to think of this is less features are utilized.

29
Q

Describe regularization

A

A process to reduce the generalization error but not the training error.

30
Q

When is the validation set?

A

A set of examples different from the training and test set. The set is used to estimate the generalization error of the model and to adjust the hyperparameters accordingly.

31
Q

What is the k-fold cross-validation algorithm?

A

This algorithm is used when too little data is available to properly estimate the generalization error of a machine learning algorithm using a standard train/test split of the data. Break the data into k non-overlapping test subsets of the dataset. Perform k trails, using the kth subset as the test set. Average the error across the k trials to approximate the generalization error.

32
Q

What is a point estimator or statistic?

A

A function that attempts to provide the best estimate of a quantity. For example, it can be the parameters of a parametric function or a whole function. The formal definition is a function whose input is a set of i.i.d. training examples that approximate some quantity, like the parameters of a learning algorithm. It does not need to predict the true parameter values to be an estimator. The resulting estimate of the parameters is a random variable itself because it is a function of data that is generated through a random process.

33
Q

What is a function estimation?

A

A point estimation that refers to the relationship between an input and an output or a point estimator in function space. We would like to predict a function that takes in some input and produces an output.

34
Q

What is the bias of an estimator?

A

The difference between the mean of the predicted output and the true output.

35
Q

What does it mean for an estimator to be unbiased?

A

The bias is zero.

36
Q

What does it mean for an estimator to be asymptotically unbiased?

A

As the number of examples approaches infinity, the bias becomes zero.

37
Q

What is the standard error of an estimator?

A

The square root of the variance of the estimated parameter.

38
Q

Why does the sample mean (sum of all samples divided by their total count) tend to have a variance proportional to the sample’s true variance and inversely proportional to the number of samples?

A

The variance of random variables with a multiplicative constant is the product of the variance of the variable and the square of the constant. See Variance and its Properties.

39
Q

Name two ways of quantitatively comparing two estimators

A

Cross-validation and mean squared error

40
Q

How does mean squared error of an estimator compare with bias and variance?

A

It is the sum of the bias squared and variance which is equal to the expectation value of the difference between the estimated and expected value squared (see mean squared error of an estimator).

41
Q

How is capacity, variance and bias related for an estimator?

A

Increasing capacity tends to increase variance and decrease bias. More bias means underfitting the model while more variance means overfitting.

42
Q

What is consistency?

A

The parameter approaches the true parameter as the number of examples increases. See Consistency and convergence in probability.

43
Q

Consistency is the common name for what specific form of consistency?

A

Weak consistency.

44
Q

What is strong consistency?

A

Almost sure convergence of the estimated parameter to the true parameter. See almost convergence equation.

45
Q

Describe the difference between almost convergence and convergence in probability?

A

Almost sure convergence implies convergence in probability.

46
Q

How are consistency and bias related?

A

Consistency means that bias diminishes with more examples. However, asymptotic unbiasedness does not imply consistency.

47
Q

What is KL divergence?

A

A measure of the dissimilarity between two distributions. See KL Divergence equation

48
Q

What is another phrase for minimizing KL divergence?

A

Minimizing the cross entropy.

49
Q

What is the maximum likelihood estimator?

A

Given a set of examples generated from an unknown distribution, create a model from a parametric family of probability distributions that estimate the unknown distribution. Think of each example as individual random variables where each random variable is independent and identically distributed. The example data represents a specific value for each random variable. Feed the set of values for each random variable into the model. Choose the parameter that provides the highest probability. See the maximum likelihood estimator for the equation when the log is applied and when normalized by the number of examples.

50
Q

How is minimizing the KL divergence related to the maximum likelihood estimator?

A

They are equivalent mathematically. Maximizing a set of example’s probability in a model is the same as minimizing the difference between the model distribution and the example data distribution. See the KL Divergence and Maximum Likelihood Estimation equations.

51
Q

Generalize maximum likelihood estimation for conditional probability. Also explain why this is important for machine learning.

A

If we are given a set of examples and their corresponding responses or outputs, this can be captured by a conditional probability or what is the probability of the output given the example inputs. Using a parameterized family of probability distributions again as a model, the model can be decomposed into a sum of logs over each example and corresponding output. This is because the examples are assumed to be i.i.d.. Note that this is very similar to the nonconditional probability version of maximum likelihood estimation, except we have not normalized by the number of samples to get the expectation value. See Maximum likelihood estimation for conditional probability. This is important because supervised machine learning often deals with conditional probability.

52
Q
A
52
Q
A