Topic 2: Machine Learning: Introduction to Algorithms Flashcards

1
Q

Explain why we estimate a function with data

A

For prediction or inference reasons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the role of error terms (reducible and

irreducible) and why is the irreducible error larger than zero?

A

Reducible error occurs when the estimate for f can be improved (e.g. using a better statistical model.

Irreducible error is the part of the error that cannot be reduced (because Y is also a function of the error)

The error is larger than zero because the error may contain unmeasured variables that are useful in predicting Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Difference between prediction and inference

A

Prediction uses X to predict Y (f is treated as black box)

Inference is estimating f but not necessarily make predictions for Y (f cannot be tread as a black box)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference between a parametric and non-parametric approach when applying a statistical learning method to the training data.

A

Parametric approach involves a two-step model-based approach.

  1. Make an assumption about the functional form, or shape.
  2. Select procedure that uses the training data to fit or train the model (e.g. OLS).

Non-parametric approach do not make an explicit assumption about the functional form or shape of f. Downside is that you need a large dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the trade-offs between prediction accuracy, flexibility, and model interpretability, including the role of overfitting.

A

As the flexibility of a method increases its interpretability decreases. Highly flexible methods have a greater potential for overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reason that we might prefer a more restrictive model

A

When we are mainly interested in inference restrictive models are more interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is a supervised learning model preferable to unsupervised?

A

Supervised learning models are used when you want to fit a model that relates the response to predictors.

With unsupervised learning models there is no response variable that can supervise the variable (e.g. clustering).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Difference between quantitative and qualitative problems?

A

Quantitative -> regression problems (numerical values)

Qualitative -> categorical problems (classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Interpret the Mean Squared Error (MSE)

A

MSE will be smaller if the predicted responses are very close to the true responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the goal of measuring the quality of fit by minimizing training and test mean square errors (MSEs)

A

quality of fit -> how well predictions match observed data

The quality of fit is measured by MSE, you want to choose the method that has the lowest test MSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Implications of different levels of flexibility

(degrees of freedom) for both training and test MSEs.

A

As model flexibility increases, training MSE will decrease, but the test MSE may not.

Overfitting the data (small training MSE but large test MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean when a method is overfitting the data?

A

When the model is working too hard to find patterns in the training data, and may be picking up some spurious patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the purpose of cross-validation.

A

A method for estimating test MSE using the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the bias-variance trade-off with an MSE decomposition into three fundamental quantities.

A

Expected test MSE can be decomposed into:

  1. variance of f
  2. the squared bias of f
  3. the variance of the error

Lower bias (better fit of training data) can lead to higher variance (in the testing data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ideal machine learning algorithm characteristics

A

The algorithm has low bias (can model the true relationship accurately) and low variance (by producing consistent predictions across different datasets)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the features of a Bayes classifier (two classes)

A

A Bayes classifier assigns each observation to the most likely class given its predictor values.

The Bayes decision boundary is the line which represents the points where the probably is exactly 50%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Bayes error rate?

A

It is the lowest possible test error rate in classification which is produced by the Bayes classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Apply/calculate the Bayes error rate

A

1 - [the class with the highest probability of Y belonging to class J]

or 1-E[maxj PR(Y = j|X)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is K-nearest neighbor classifier related to the Bayes classifier?

A

KNN attempts to estimate the conditional distribution of Y given X, then classifies a given observation with the highest estimated probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the effects of a low K and high K in K-nearest neighbor?

A

Lower K on training data gives higher flexibility (or low bias) but has a very high variance when applied to test data.

High K will give lower flexibility (high bias) but lower variance when applied to test data (decision boundary will become more linear with higher Ks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the use of a training set as an alternative to a rules-based program to recognize digits.

A

A rules-based program would lead to a lot of exceptions and caveats while using training data you can use examples to infer rules for recognizing digits.

22
Q

What is a perceptron and how does it work?

A

A perceptron is a artificial neuron used in a neural network. It takes several binary inputs (x1, x2, x3) and produces a single binary output.

23
Q

Calculate the output of a perceptron neuron

A

Sum of WjXj + bias = 1 when positive, 0 when negative

24
Q

Describe the intuition of a perceptron as a decision-making device

A

You use perceptrons to weigh evidence to make decisions.

25
Q

Describe a perceptron as a NAND gate and what it implies for perceptron networks concerning computing logical functions.

A

A combination of the same integers (eg. 11 or 00) will produce a negative output (and thus 0)

26
Q

Explain how perceptron neurons are more than new types of NAND gates.

A

With perceptron you can create learning algorithms that can automatically tune the weights and biases of a network of neurons. (ie more than just a conventional circuit with a NAND gate)

27
Q

Name a limitation of perceptron neurons that can be overcome by sigmoid neurons?

A

Perceptrons are not able to learn, while sigmoid neurons are able to be modified so that small changes in their weights and bias only cause a small change in output.

28
Q

How does a perceptron neuron differentiate from a sigmoid neuron?

A

A perceptron takes on the value 1 or 0 while a sigmoid ranges between 0 and 1. It takes on a value close to 1 for very high positive values of x and a values close to 0 very high negative values of x.

29
Q

Explain the importance of the smoothness of the sigmoid function.

A

The smoothness of the function means that a small change in the weights and in the bias will produce a small change in the output (so not a 0 or 1)

30
Q

Identify components of a simple network with appropriate terminology.

A

Most left you have the input layer with input neurons, and most right you have the output layer with output neurons. Any layer in between is called a hidden layer.

31
Q

Describe the central feature of a feed-forward network.

A

Information is always fed forward, never fed back

32
Q

Compare and contrast feedforward networks with recurrent networks.

A
  • Recurrent networks allow for feedback loops
  • Less influential and less powerful
  • Closer in spirit to how our brain works
33
Q

Compare and contrast feedforward networks with recurrent networks.

A
  • Recurrent networks allow for feedback loops
  • Less influential and less powerful
  • Closer in spirit to how our brain works
34
Q

Calculate the required input neurons for classifying an individual digit in an image of a specific size in pixels.

A

width in pixels times height in pixels

35
Q

Explain the choice to use ten output neurons instead of four for classifying an individual digit.

A

The network would have trouble determining what the most significant part of the digit was (in the hidden layers it looks at multiple parts of a digit)

36
Q

Explain why minimizing a quadratic cost function is preferable to working with other types of cost functions.

A

It is a smooth function that makes it easy to measure improved performance by changing weights and biases.

37
Q

How can you apply gradient descent in a neural network?

A

You can use gradient descent to find the weights and biases which minimize the cost in the cost function

38
Q

Explain how quickly stochastic gradient descent can speed up learning given a training set size, n, and a mini-batch size, m.

A

You estimate the gradient by computing the cost function for a small sample of randomly chosen training inputs.

 training size (n) = 6.000
mini batch size (m) = 10
speedup in estimating gradient = 600 (n/m)
39
Q

Describe the role of hyper-parameters and their impact on output for each epoch.

A

Hyper-parameters are the variables that set the network structure and are set before training (e.g. learning rate). If you choose your hyper-parameters poorly, you can get bad results.

40
Q

Describe deep learning in terms of neural networks and their performance relative to networks that are not based on deep learning methods.

A

Deep learning are based on stochastic gradient descent and backpropagation among other things. They have more layers and perform far better on many problems than shallow neural networks.

41
Q

Describe Reinforcement Learning (RL)

A

A subfield of ML that teaches an agent how to choose an action from its action space, within a particular environment, in order to maximize rewards over time.

42
Q

Construct a task as an Reinforcement problem.

A

An RL has four essential elements:

  1. An agent (the program you train)
  2. Environment (The world, real or virtual)
  3. Action (A move made by the agent)
  4. Rewards (Evaluation of an action, positive or negative)
43
Q

Compare and contrast RL with supervised and unsupervised learning.

A
  1. Static vs. Dynamic -> RL = Dynamic
  2. No Explicit Right Answer -> RL = trial and error
  3. RL Requires Exploration -> RL looks for new ways
  4. RL is a Multiple-Decision Process -> RL = decision chain
44
Q

Define the Markov property and Markov chain.

A

Markov property: memoryless property of a stochastic - or randomly determined - process.

Markov Chain: Stochastic model whereby the probability of each event depends only on the state attained in the previous event.

45
Q

Describe how the Markov property works.

A

The state of X at time t+1 only depends on the preceding state of X at time t, and is independent of past states.

46
Q

Describe how the Markov chain puts the Markov property into action.

A

the Markov Chain, which works with S, a set of states, and P, the probability of
transitioning from one to the next.

47
Q

How does a Markov Decision Process differ from the Markov Chain?

A

It brings actions into play. The next state is not only related to the current state but also the actions taken in the current state.

48
Q

Define the MDP

A

An MDP is a 5-tuple (S, A, P, R, y) where:

  1. S is a set of states
  2. A is a set of actions
  3. P is the probability that an action in state s at time t will lead to state s’ at time t+1
  4. R is the immediate reward received after a transition from state s to s’ due to action a
  5. y is the discounted factor
49
Q

Explain why and how to use discounted rewards.

A

Future rewards can be valued differently than current rewards.

You can use the discount rate to discount rewards from time t to T.

50
Q

Describe the rewards in relation to the discount rate at different discount rates.

A

If the discount rate is close to 0, future rewards won’t count for much in comparison to immediate rewards.