Machine Learning Flashcards

1
Q

What is supervised learning?

A

A method of training a model by observing its performance with labelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is unsupervised learning?

A

A method of identifying structures within data without the aid of response variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a regression problem?

A

A supervised learning task where the response variable is continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a classification problem?

A

A supervised learning task where the response variable is categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the equation we solve for when graphically fitting a logistic regression model?

A

logit(0.5) = B0 + B1x1 + B2x2 [ + … + Bpxp]

  • rearranging the equation to make x2 the subject of the formula then gives us a line where the probability of “success” is equal to 0.5. Observations above and below the line will be predicted as difference classes. The feature space is partitioned
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

At a high level, how to tree-based methods work?

A

The predict classes by partitioning the feature space along each of its respective dimensions. The partitioning then results in certain regions which are associated with certain responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the R code for fitting a classification tree to a dataset and printing some information about the tree?

A

> library(tree)
treeModel = tree(Y ~ X1 + X2)
summary(treeModel)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a terminal node in the context of classification trees?

A

The points at which the binary tree stops splitting. Graphically, these are the end points of the tree and represent where prediction is finalized. Also known as leaf nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What process do most software packages use in rigorously determining the partition points in the feature space?

A

Recursive binary splitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the R code to fit a logistic regression model to a dataset?

A

> logisticModel = gym(Y ~ . , family = binomial(link = “logit”))
summary(logisticModel)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a regression tree?

A

A classification tree where the classes are continuous numbers (eg. house valuations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two main advantages of using tree models?

A
  • tree models can handle highly non-linear data

- tree models are relatively easy to interpret despite their non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does RSS stand for?

A

Residual sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the learning algorithm or procedure for fitting a tree?

A

Initial step: start with the entire unseparated feature space find the split positions which lead to the greatest reduction in RSS for each dimension (assuming that split point is above the RSS reduction threshold). Retain the top reduction among these.

Iterate: Split the data according to the split in the initial step. In each of the two regions, perform the same procedure as in the initial step.

Stop: If no split can be found that exceeds the minimum RSS reduction threshold then stop and return the resulting tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are impurity measures and what are 3 common impurity measures?

A

A series of methods for calculating the accuracy of classification trees. We can not use traditional RSS in the case of binary data.

3 common impurity measures:

  • misclassification error
  • Gini-index
  • Shannon Entropy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For a binary tree, what is the prediction of a model for region R(j)? also known as [pi hat]

A

[1/N].[sum of I(y=1)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the equation for misclassification error in the context of binary trees?

A

1 - max(pi hat, 1- pi hat)

  • we want to minimize this quantity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the equation for Gini-index in the context of binary trees?

A

2[pi hat].[1-(pi hat)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the equation for Shannon Entropy in the context of binary trees?

A
  • refer to notes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is one of the downsides of using tree-based methods?

A

They are prone to overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is overfitting?

A

Overfitting is a situation where the model replicates feature patterns seen in a particular dataset instead of “learning” the true feature patterns of the underlying data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is in-sample-error?

A

A measure of the performance of a model based on seen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is out-of-sample-error?

A

A measure of the performance of a model based on unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the bias-variance tradeoff?

A

Sufficiently complex models can in theory replicate the data used to train said models almost exactly. This is not good for purposes of learning any underlying pattern. Such models exhibit high variance and will not perform well on unseen data. Though we can simplify models to remedy this, we also run the risk of simplifying the model too much

25
Q

What is a biased model?

A

A model which does not replicate salient features in the underlying pattern, also producing poor predictions

26
Q

What does parsimonious mean?

A

Unwilling to spend. In the context of machine learning, this is used to describe more conservative and simpler models

27
Q

What are the two ways of ensuring that a tree model does not overfit to the dataset?

A
  • limit the number of terminal nodes by increasing the minimum RSS reduction threshold (this is not optimal because it might exclude useful partitions)
  • penalize the model for producing predictions that fit the data too closely (therefore Penalized Obj = Obj + Penalty)
  • we are still trying to minimize this new objective function
28
Q

Give an example of a penalized objective function that could be used for a tree-based classification model?

A

Penalized obj. = Shannon entropy + alpha|J|
* where J is the number of terminal nodes in the tree and alpha is a scaling quantity that we can choose. Alpha is therefor a hyper parameter

29
Q

What is pruning in the context of fitted tree models?

A

The process of rolling the tree back to smaller sub-trees to avoid overfitting

30
Q

What is cost-complexity pruning the context of fitted tree models?

A

The process of reducing the complexity of a tree by using an adjusted cost function which penalizes overfitting

31
Q

What is a balanced tree in the context of fitted tree models?

A

An optimal sub-tree which is sufficiently generalizable for the prediction problem at hand

32
Q

How do we select optimal tree size when looking at a graph of cost-complexity parameter (cp) and error?

A

We select the simplest model which overlaps (in terms of standard deviations) with the best performing tree on the training data

33
Q

What is the validation set approach to splitting our data when modeling?

A

We divide the data into:

  • training set which will be used to train the model
  • validation set which will be used to tweak the model and validate its performance
  • test set which we only use once when reporting the true performance of the model
34
Q

What is the K-fold cross-validation approach to splitting our data when modeling?

A
  • We split the data into K partitions
  • K-1 folds/partitions are used to train the data and remaining fold is used for validation
  • the folds are then cycled until all folds have been used for training and validation
  • test set is left untouched until reporting time
  • we utilize more of our data for training
  • we have a small sample size of K for our validation errors which we can then calculate mean and variance from
35
Q

What is a neural network?

A

A type of non-linear model which uses interconnected nodes and mathematical functions to produce responses. They are primarily used for supervised learning but can be easily adapted to unsupervised learning tasks

36
Q

In a graphical presentation of a neural network, what does a node represent?

A

An activation (or activation function)

37
Q

In a graphical presentation of a neural network, what does an edge represent?

A

A weight (or parameter)

38
Q

Why do we refer to the a hidden layer as “hidden”?

A

Since the nodes in this layer do not contain any direct observations, their state cannot be known

39
Q

How do we calculate the depth of a neural network?

A

The number of hidden layers + 1

40
Q

What does a(j)^l denote?

A

The jth node in the lth layer of the network

41
Q

What does d(l) denote?

A

The number of nodes in the lth layer of the network

42
Q

What does w(kj)^l denote?

A

The weight parameter linking the kth node in layer l-1 and the jth node in layer l

43
Q

What does b(j)^l denote?

A

The jth bias in layer l

44
Q

What is the R code for creating a random sample of size n from a uniform distribution over bounds a and b?

A

> runif(n , a , b)

45
Q

What is the R code for creating a data object using matrices?

A

> data.frame(Y=someVector, X=someMatrix/Vector)

46
Q

What is the R code for fitting a neural net with two hidden layers of size n and n-1? (and also printing the fitted network)

A

> library(neuralnet)
neuralNetModel = neuralnet(Y~X, hidden = c(n, n-1), data = yourData)
plot(neuralNetModel)

47
Q

What does DGP stand for?

A

Data generating process (i.e. the true underlying function which is producing the data)

48
Q

What is the name of the optimization technique used to fit neural networks and how does it work?

A

The gradient descent algorithm. It assesses the plane defined by the cost function and updates the parameters based on the partial derivatives (slope of the plane) leading to a local minimum

x(i+1) = xi + lambda.g(xi)

  • where g() represents the partial derivative of the plane
  • where lambda represents the learning rate
49
Q

When will the gradient decent algorithm stop updating parameters?

A

Either
- after some predefined number of steps
or
- when some predefined step size is achieved (i.e. gradient levels out to some tolerance level)

50
Q

What does MSE stand for?

A

Mean Square Error

51
Q

How do you calculate MSE for a neural network?

A

[1/N].sum of [prediction - actual]²

52
Q

What are the 4 main activation functions used in neural networks?

A
  • the logistic function [output between 0 and 1]
  • the rectified linear units functions (ReLU) [output between 0 and inf]
  • the hyperbolic tangent function (tan-h) [output between -1 and 1]
  • the identity function [output is -inf to inf] *typically for regression problems

*refer to notes for their equations

53
Q

What is an objective function?

A

A function which we are trying to maximize or minimize

54
Q

What kind of objective functions do we use for neural networks?

A
  • MSE for regression problems
  • Cross-Entropy Error for classification problems
  • refer to notes for cross entropy equation
55
Q

What is regularization in the context of neural networks?

A

A method of penalizing overfitting

56
Q

What is L2 regularization in the context of neural networks?

A

The adding of a penalty term to the objective function which penalizes the model for overfitting

57
Q

What is the procedure for fitting an optimal neural network for a given architecture?

A

1) Start with an over specified unconstrained model

2) Find a value for the L2 regularization parameter which optimizes validation performance

58
Q

What does the term “epochs” refer to in the context of machine learning?

A

The number of passes of the entire training set the machine learning algorithm has completed. If the batch size is the entire training set, then the number of epochs will be the number of iterations.