Exam Questions Flashcards

1
Q

Define what is meant by supervised learning

A
Supervised learning is 
... the machine learning task of 
... learning a function 
... that maps an input to an output 
... based on example input-output pairs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Discuss the training data used in supervised learning

A

Supervised learning infers a function from a set of LABELED training data.

Each example is a pair consisting of

  • an input object (typically a vector) and
  • a desired output value (also called the supervisory signal).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Discuss the desired result of a supervised learning algorithm

A

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for UNSEEN instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define unsupervised learning

A

Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized.

Instead of responding to feedback, unsupervised learning
… IDENTIFIES COMMONALITIES in the data and
… reacts based on the presence or absence of such commonalities in each new piece of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 Main classes of supervised learning problems

A
  • classification

- regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Supervised learning:

Define classification

A

Classification is the problem of
identifying to which of a set of categories (sub-populations)
a new observation belongs,
on the basis of a training set of data
containing observations (or instances) whose category membership is known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Supervised learning:

Define regression analysis

A

Regression analysis is a set of statistical processes for estimating the relationships among variables.

Regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 of the objectives of supervised learning

A
  • Inference

- Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define statistical inference

A

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.

Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.

It is assumed that the observed data set is sampled from a larger population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define predictive inference

A

Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe least squares linear regression

A

Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model.

OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

5 Assumptions of Ordinary Least Squares (OLS) Linear Regression

A
  • Linearity (Correct specification)
  • Strict exogeneity
  • No linear dependence in the regressors
  • Spherical errors
  • Normality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Linearity (Correct specification)

A

The mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Normality of the errors

A

The errors have normal distribution conditional on the regressors:

ε | X ~ N(0, σ² Iₙ )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Lack of perfect multicollinearity in the predictors

A

For standard OLS estimation methods, the matrix X must have full column rank, p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Strict exogeneity

A

The errors in the regression should have a conditional expectation of zero: E[ ε | X ] = 0

The immediate consequence of the exogeneity assumption are that:

  • the errors, , have an expectation of zero: E[ ε ] = 0
  • the regressors are uncorrelated with the errors: E[ Xᵀε ] = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Spherical errors

A

Var[ ε | X ] = σ² Iₙ , where In is the identity matrix in dimension n and 2 is a parameter which determines the variance of each observation.
This assumption can be split into two parts:

Homoscedasticity:
E[ εᵢ² | X ] = σ², which means that the error term has the same variance σ² in each observation.

No autocorrelation:
The errors are uncorrelated between observations: E[ εᵢ εⱼ | X ] = 0 for i ≠ j .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

K-nearest neighbours classification

A

A non-parametric method used for classification.

The input consists of the k closest training examples in the feature space.

The output is a class membership.

An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

K-nearest neighbours regression

A

A non-parametric method used for classification.

The input consists of the k closest training examples in the feature space.

The output is the property value for the object.

The value is the average of the values of its k nearest neighbours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

History of Neural Networks:

McCulloch and Pitts

A

1943
Warren McCulloch - a neurophysiologist
Walter Pitts - a mathematician

They wrote a paper on how neurons might function.
They modelled a simple neural network with electrical circuits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

History of Neural Networks:

2 Major concepts that preceded Neural Networks

A
  • Threshold logic - converting continuous input to discrete output
  • Hebbian learning - a model of learning based on neural plasticity, proposed in “The Organization of Behaviour” by Donald Hebb in 1949
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

McCulloch-Pitts neuron

A

Takes a weighted sum of some inputs and returns ‘0’ if the result is below a threshold and ‘1’ otherwise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

History of Neural Networks:

Mark I Perceptron

A

In 1958, Frank Rosenblatt - a psychologist at Cornell - proposed the idea of a Perceptron.

It was a system with a simple input output relationship modelled on a McCulloch-Pitts neuron.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

History of Neural Networks:

Advantage of the Mark I Perceptron

A

Its weights could be ‘learnt’ through successively passed inputs, while minimizing the difference between desired and actual output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

History of Neural Networks:

Major shortcoming of the Mark I Perceptron

A

It could only learn to separate linearly separable classes, and thus it wasn’t able to model the simple, but non-linear XOR (exclusive-or) circuit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

History of Neural Networks:

Marvin Minsky’s book

A

Marvin Minsky’s book “Perceptrons” argued that Rosenblatt’s single perceptron approach to neural netwroks could not be translated effectively into multi-layered neural networks.

To evaluate the correct relative values of the weights of the neurons, spread across layers; based on the final output, would take several (if not infinite) number of iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

History of Neural Networks:

Backpropagation

A

Backpropagation’s potential for neural netws was first noticed by Paul Werbos in his PhD thesis on their importance.

This work was re-discovered by Rumelhart, Hinton and Williams, who republished it in a clear and detailed framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is meant by feature extraction?

A

Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing.

A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process.
Feature extraction is the name for methods that combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

The process of feature extraction is useful when you need to reduce the number of resources needed for processing without losing important or relevant information.
Feature extraction can also reduce the amount of redundant data for a given analysis.
Also, the reduction of the data and the machine’s efforts in building variable combinations (features) facilitate the following learning and generalization steps in the statistical learning process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Principal Components Analysis

A

A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables (called principal components).

The transformation is defined in such a way that the first principal component accounts for the greatest possible amount of variability in the data, and each succeeding component in turn has the highest variance possible subject to the constraint that it is orthogonal to the preceding components.

The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.

PCA is often used as a dimensionality reduction technique, by using only the first few principal components (which capture the greatest amount of variability in the data) as inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

4 Steps involved in Supervised PCA

A

Compute standard regression coefficients for each feature. This gives an indication of the “correlation” of each feature with the response.

Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold θ in absolute value (θ is estimated by cross-validation).

Compute the first (or first few) principal components of the reduced data matrix.

Use these principal component(s) in a regression model to predict the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Maximal Margin Classifier

A

In the case of linearly separable data, we can use the optimal separating hyperplane to construct a maximum margin classifier.

The hyperplane, defined as {x: w·x + b = 0} will perfectly split the data from each of the two classes (as they are linearly separable), and we can use a simple classification rule as:

G(x) = sign[ w·x + b ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Problem with a maximum margin classifier:

A

The existence of an optimal separating hyperplane cannot be guaranteed.
Or, even if it does exist, the data might be noisy, meaning that the maximal margin classifier provides a poor solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Define Variable selection

A

Variable selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Feature selection techniques are used for 4 reasons:

A
  • Simplification of models to make them more interpretable.
  • To shorten training times.
  • To avoid the curse of dimensionality.
  • To enhance generalization by reducing overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Best subset selection

A

When performing best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. I.e. we fit all p models that contain exactly one predictor, all ₚC₂ = p(p-1)/2 models that contain exactly two predictors, and so forth.

The problem of selecting the best model from among the 2ᵖ possibilities considered by best subset selection is not trivial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

3 Steps of the Best Subset Selection Algorithm

A
  • Let M₀ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

For k = 1, 2, …, p:

  • Fit all ₚCₖ models that contain exactly k predictors.
  • Pick the best among these ₚCₖ models and call it Mₖ. Here best is defined as having the smallest RSS, or equivalently the largest R².
  • Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Limitations of Best Subset Selection

A

While best subset selection is simple and conceptually appealing, it suffers from computational limitations.

The number of possible models that must be considered grows rapidly as p increases. In general, there are 2^p models that involve subsets of p predictors.
Consequently, best subset selection becomes computationally infeasible for values of p greater than ~40.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Forward Stepwise selection

A

Forward Stepwise selection begins with a model containing no predictors, then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

In particular, at each step, the variable that gives the greatest additional improvement to the fit is added to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

3 Steps of the Forward Stepwise Selection algorithm

A
  • Let M₀ denote the null model, which contains no predictors.

For k = 0, …, p-1:

  • Consider all p - k models that augment the predictors in Mk with one additional predictor.
  • Choose the best among these p-k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R².
  • Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Backward Stepwise selection

A

Backward Stepwise selection begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

3 Steps of the Backward Stepwise selection algorithm

A
  • Let Mₚ denote the full model, which contains all p predictors.

For k = p, p-1, …, 1:

  • Consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors.
  • Choose the best among these k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R2.
  • Select a single best model from among M0, …, Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Contrast forward and backward stepwise selection

A

Backward stepwise selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n<p></p>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the purpose of regularisation?

A

To create a less complex (parsimonious) model when dealing with a large number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Define what is meant by a PGM

A

A Probabilistic Graphical Model is
… a probabilistic model
… for which a graph expresses the conditional dependence structure
… between a set of random variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

2 Main types of PGMs

A
  • Directed PGMs

- Undirected PGMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

3 Properties of Undirected PGMs

A
  • The edges in the graph have no directions
  • Only dependence is indicated.
  • No causality can be inferred.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

2 Properties of directed PGMs

A
  • The edges in the graph have directions.

- The directions indicate causality and dependence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Main use of PGMs

A

They are used to investigate the joint distribution of a set of random variables.

49
Q

Define (in the context of a PGM):

Adjacent vertices

A

Two vertices, X and Y, are called adjacent if there is an edge joining them.

This is denoted by X ~ Y.

50
Q

Define (in the context of a PGM):

A path

A

A path X1, X2, …, Xn is a set of vertices that are joined.

I.e. X_{i-1} ~ Xi for i = 2, …, n

51
Q

Define (in the context of a PGM):

A complete graph

A

A complete graph is a graph with every pair of vertices joined by an edge.

52
Q

Define (in the context of a PGM):

A sub-graph

A

A subgraph U ∈ V is a subset of vertices together with their edges.

53
Q

Define (in the context of a PGM):

A separator

A

If A, B, and C are subgraphs, then C is said to separate A and B if every path between A and B intersects a node in C.

54
Q

Define (in the context of a PGM):

The pairwis

A

The absence of an edge between X and Y in a graph G, implies conditional independence given the rest of the vertices in G.

I.e. No edge joining X and Y <=> X⟂Y | rest

55
Q

Define (in the context of a PGM):

The Global Markov properties of a graph

A

In a Markov graph G with subgraphs A, B and C,

if C separates A and B
then
A ⊥ B | C.

56
Q

Define (in the context of a PGM):

A clique

A

A clique is a complete subgraph - a set of vertices that are all adjacent to one another.

57
Q

Define (in the context of a PGM):

A maximal clique

A

A clique is called maximal if it is a clique and no other vertices can be added to it and still yield a clique.

58
Q

The potential of a clique

A

A probability density function f over a Markov graph G can be can represented as

f(x) = (1/Z) ∏ ψ_c(x_c)

where C is the set of maximal cliques, and the positive functions ψ_c(.) are called clique potentials.

59
Q

The partition function

A

The quantity

Z = ∑_x ∏_c { ψ_c(x_c) }

is the normalizing constant for f(x), also known as the partition function.

60
Q

A pairwise Markov graph

A

For a pairwise Markov graph, there is:

  • A potential function for each edge
  • At most second order interactions are represented.
61
Q

Advantages of Markov Graphs

A
  • more parsimonious in terms of parameters,
  • easier to work with,
  • give the minimal complexity implied by the graph structure.

The models for both continuous and discrete data are functions of only the pairwise marginal distributions of the variables represented in the edge set.

62
Q

Discuss preconditioning as a method of improving the performance of a variable selection approach such as the lasso.

A

Variable selection techniques can be adversely affected by the noise present in a dataset.

Preconditioning involves reducing the noise in the dataset such that the lasso can perform better.

One possible method of preconditioning is to use supervised principal components.

Apply the usual supervised principal component algorithm and consider the m principal components obtained from this method.. Let Z = [z1, z2, …, zm] be the matrix consisting of the m supervised principal components.

Now, to reduce the noise present in the dataset, regress z1 onto y and obtain fitted values y_hat.

Then the lasso can be applied when y_hat is taken to be the response and X to be the predictors.

We are then interested in the variables that are selected from this process.

De-noising y is expected to enhance the performance of the lasso.

Why would this work?

  • Overfitting is prevented by using the denoised response variable
  • The tuning parameter λ is objectively chosen to obtain sparsity.

Preconditioning is not limited to using supervised PCA.

63
Q

How does a Neural Network compare to using basis function expansions?

A

The neurons in the activation outputs, o, are simply calculating basis expansions of the original features in X.

The subtle difference for the two methods is that for NN’s the parameters of the basis functions are learnt from the data.

64
Q

What is meant by a linear basis function expansion?

A

The core idea is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features.

Once the basis functions hm have been determined, the models are linear in these new variables, and the fitting proceeds as before.

65
Q

Cubic spline models

A

Cubic spline models fit piecewise cubic polynomials to the data, subject to 3 constraints:

  • The fit should be continuous at the knots.
  • The fit should have continuous first derivatives at the knots.
  • The fit should have continuous second derivatives at the knots.
66
Q

Natural Cubic Spline

A

A natural cubic spline adds additional constraints, namely that the function is linear beyond the boundary knots.

This frees up four degrees of freedom (two constraints each in both boundary regions), which can be spent more profitably by sprinkling more knots in the interior region.

67
Q

How might a natural cubic spline be fit in a binary classification context?

A

To implement natural cubic splines in a binary classification context, the natural cubic spline can be used in the usual regression setting, but with the response Y ∈ {-1, 1}.

The results from such a model can be quite difficult to interpret, therefore a better alternative is to use logistic regression, with a basis function expansion.

68
Q

Smoothing splines

A

Smoothing splines avoid the knot selection problem completely by using a maximal set of knots.

The complexity of the fit is controlled by regularization.

69
Q

Perceptron learning algorithm

A

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.

If a response yi = 1 is misclassified, then xi’β + β_0 < 0, and the opposite for a misclassified response with yi = −1.

SEE NOTES

70
Q

3 Problems with the perceptron learning algorithm

A
  • When the data are separable, there are many solutions, and which one is found depends on the starting values.
  • The “finite” number of steps can be very large. The smaller the gap, the longer the time to find it.
  • When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.
71
Q

Define what is meant by linearly separable training data in a binary classification problem

A

Two sets of points are linearly separable if there exists at least one hyperplane in the space (or a line in the two-dimensional case) with all the points from either class on the same side of the plane.

I.e., the points from the different classes are able to be perfectly separated by some linear boundary.

72
Q

Compare linear logistic regression and linear discriminant analysis.

A

LDA predicts two normal density functions (one for each class) that creates a linear boundary where they intersect, whereas logistic regression only predicts the log-odds function (logit) between the two classes.

73
Q

Compare linear logistic regression and linear discriminant analysis.

Parameter estimation

A

BLR:
Based on Maximum likelihood estimation.

LDA:
Based on Least squares estimation; equivalent to linear regression with binary predictand.

74
Q

Generalized Linear Models

A

Generalized Linear Models is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

This is done by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

75
Q

3 Elements of a GLM

A
  • A probability distribution from the exponential family
  • A linear predictor η = Xβ
  • A link function g such that E(Y) = μ = g⁻¹(η)
76
Q

Any thoughts on variable selection in high-dimensional problems?

A

When modelling a highly dimensional problem with a relatively small dataset, overfitting becomes a major concern. Therefore, heavy / rigid structural assumptions as well as variable selection are needed to provide sensible results in such cases.

In this setting, very few predictors may truly have a significant influence.

Variable selection can thus help make the model:

  • simpler,
  • easier to interpret, - computationally more efficient and
  • exhibit lower variance.
77
Q

Forward-stagewise regression

A

Forward-stagewise regression (FS) is even more constrained than forward-stepwise regression.

It starts like forward-stepwise regression, with an intercept equal to y, and centered predictors with coefficients initially all 0.

At each step the algorithm identifies the variable most correlated with the current residual.

It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current coefficient for that variable.

This is continued till none of the variables have correlation with the residuals—i.e. the least-squares fit when N > p.

Unlike forward-stepwise regression, NONE OF THE OTHER VARIABLES ARE ADJUSTED when a term is added to the model.

As a consequence, forward stagewise can take many more than p steps to reach the least squares fit, and historically has been dismissed as being inefficient.

It turns out that this “slow fitting” can pay dividends in high-dimensional problems.

78
Q

Compare step-wise approaches to the best-subset variable selection approach.

A

Rather than searching through all possible subsets of predictors, stepwise approaches take a more constrained search:

  • They explore a far more restricted set of models.
  • They are greedy algorithms, as a particular predictor can be included or dropped without considering the future steps of the procedure.

Therefore, stepwise procedures are unlikely to find the same “best” subset of predictors as the best-subset approach.

However, an advantage is that stepwise procedures are much less computationally expensive as they search over a highly restricted set of models.

79
Q

What is meant by L1 regularization?

A

L1 regularization modifies the loss function by adding some multiple of the sum of the absolute magnitudes of the model parameters as a penalty term to the loss function.

I.e. the regularization term in the cost function becomes:
J(𝜷) = ||𝜷|| = Σ |βⱼ|

When applied to regression, the procedure obtained when using an L1 penalty term is referred to as Lasso regression.

80
Q

What is meant by L2 regularization?

A

L2 regularization involves adding a regularisation term to the loss function used when optimizing the parameters of some regression model.

L2 regularization modifies the loss function by adding some multiple of the sum of the squared magnitudes of the model parameters as a penalty term to the loss function.

I.e. the regularization term in the cost function above becomes:
J(𝜷) = ||𝜷||_2^2 = Σ(βⱼ)²

When applied to regression, the procedure obtained when using an L2 penalty term is referred to as Ridge regression.

81
Q

Discuss the choice of k in k-NN

A

In general, the optimal value for k will depend on the bias-variance tradeoff.

A small value for k will provide the most flexible fit, which will have low bias but high variance.
This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation.

In contrast, larger values of k provide a smoother and less variable fit; the prediction in a region is an average of several points, and so the influence of a single observation is smaller.

82
Q

Compare the assumptions of k-NN to that of liear regression

A

In contrast to the strong assumptions made by the linear model, k-NN makes no assumptions about the functional form of the problem being solved.

Therefore, k-NN is referred to as a non-parametric machine learning algorithm, whilst linear regression is an example of a parametric approach as it assumes a linear functional form for f(X).

83
Q

4 Reasons for using k-NN above linear regression

A
  • Non-linearity.
    If the relationship between X and y are known to be non-linear, k-NN would be an obviously better choice.
  • Versatility
    Linear regression is not very useful for classification, whereas k-NN is easily applied in a classification setting.
  • Accuracy.
    If you need higher accuracy than what a linear model would output (due to it only outputting some linear combination of input features and not being able to represent complex, higher order features).
  • Easy to understand.
    Not that linear regression is particularly difficult to explain, but the concept of distance from some observation is natural and (arguably) easier to explain than multiple linear regression in a highly dimensional problem.
84
Q

5 Reasons for using linear regression above k-NN

A
  • Small dataset (small n).
    If there is a small number of observations per predictor, a parametric method, like linear regression, might work better.
  • Highly dimensional dataset (large p).
    Thanks to their rigid assumptions, linear models are less likely to overfit in high dimensional datasets.
  • Interpretability.
    A k-NN lacks the same interpretability of a linear model (i.e. coefficients indicating the influence of each parameter on the response). It fails to describe our underlying data as aptly as linear regression would.
  • Speed, cost and performance.
    A k-NN is computationally expensive, both in terms of memory and computation.
    Storage: To store a k-NN model would involve storing all observations from the training data. In comparison, linear regression would store a coefficient for each feature in the input dataset, together with a bias - resulting in a much more computationally
    Computation: To classify a new example would require comparing the example to every observation in the training set, which is much more expensive than simply multiplying the features by a set of coefficients - as would be the case for linear regression.
  • Model robustness.
    If a small value of k is used, k-NN will be sensitive to irrelevant features.
85
Q

Compare the properties of the two procedures obtained from L1 and L2 regularization schemes.

A

Ridge regression does a proportional shrinkage.

Lasso translates each coefficient by a constant factor λ, truncating at zero.

Ridge regression generally only shrinks the variable coefficients to values close to zero. The Lasso, on the other hand, can set coefficients equal to zero - thereby performing variable selection.

86
Q

Many of the supervised learning procedures fall somewhere between linear least squares and a nearest-
neighbour approach.

Discuss this statement, referring to examples of other procedures.

A

At the one end of the spectrum, linear least squares makes very strong structural assumptions about the distribution of the data and the errors.

At the other end, k-Nearest Neighbours (especially with small values of k), makes no assumptions about the functional form of the problem being solved, and is entirely non-parametric.

We can see where different models fall on this spectrum by considering the assumptions made by such models.

In descending order of assumption strength:

  • linear regression
  • generalized linear models, which try to slack the OLS assumption that errors are normally distributed. In the category of GLMs also falls logistic regression,
  • local regression tries to slack the assumption of “global” linearity made by OlS and instead opts for only local linearity in the immediate surrounding of some point,
  • Basis function expansions and splines also try to further slack the linearity assumption made by OLS.
  • The Support Vector Machine also tries to allow for non-linearity through the use of kernels.

Up to so far, all methods still assume (at least to some extent) a specific “structural form” dictating the assumed distribution of the underlying data.

  • Deep Neural Networks further try to slack these assumptions about the form of the model. They are “Universal Function Approximators”, and are able to approximate any deterministic function to an arbitrary degree of accuracy. They do still, however, assume that the observations were generated by some deterministic function.
  • As mentioned, at the very end of this spectrum is k-Nearest Neighbours, which makes no assumptions about the form of the underlying distribution.
87
Q

Contrast the single key assumption made by k-NN with that made by least squares

A
  • Least squares assumes f(x) is well approximated by a GLOBALLY LINEAR function.
  • k-Nearest neighbours assumes f(x) is well approximated by a LOCALLY CONSTANT function
88
Q

List 5 ways in which the simple procedures, kNN and linear least squares have been enhanced.

A
  • Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
  • In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
  • Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
  • Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
  • Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.
89
Q

Discuss a Bayesian interpretation of the Penalised Residual Sum of Squares.

A

The penalised residual sum of squares can be cast into a Bayesian framework where we think of:

  • The unpenalised residual sum of squares as the log-likelihood of the data we observe.
  • The penalty function |β|^q as the log-prior density for β.

The penalised RSS can then be viewed as a compromise between the observed data and the prior, i.e. a posterior estimate of β.

90
Q

The bayesian prior corresponding to the lasso

A

An independent double exponential (or Laplace) distribution for each input, with density
( 1 / 2t ) exp( − |β| / t )
and
t = 1/λ.

91
Q

The bayesian prior corresponding to ridge regression

A

The prior for each parameter in β is N(0, t)².

The ridge estimate then follows to be the mode of the posterior, and since its Gaussian, it’s also the posterior mean.

92
Q

Describe the multivariate linear regression approach to the multiclass classification problem.

A

Here each of the response categories are coded via an indicator variable.

Thus if G has K classes, there will be K such indicators, Yk, k = 1, …, K,
Yk = 1 if G = k
Yk = 0 otherwise

These are collected together in a vector Y = (Y1, …, Yk), and the N training instances of these form an N x K indicator response matrix Y.

Y is a matrix of 0s and 1s, with each row having a single 1.

We fit a linear regression model to each of the columns of Y simultaneously, and the fit is given by:

Y_hat = X ( X’X )⁻ⁱ X’Y

Note that we have a coefficient vector for each response column yk, and hence a (p+1)xK coefficient matrix:

B_hat = ( X’X )⁻ⁱ X’Y

A new observation with input x is classified as follows:

  • compute the fitted output f_hat(x)’ = (1, x’) B_hat, a K vector
  • identify the largest component and classify accordingly: G_hat(x) = argmax_{k in G} f_hat_k(x)
93
Q

For subset selection, we choose the best model using: (4)

A
  • Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)
  • Cross-validation
  • Mallows’s Cp
94
Q

Comment on the applicability of the usual standard error formulas following (stepwise) variable selection

A

The standard errrors obtained from the final model resulting from the stepwise procedures cannot be used, since they do not account for the search process.

They must be calculated using a bootstrap procedure.

95
Q

Discuss the underlying rationale for the linear regression approach to multiclass classification.

A

View the regression as an estimate of conditional expectation.

For the random variable Yk,
E(Yk|X = x) = Pr(G = k|X = x),
so conditional expectation of each of the Yk seems a sensible goal.

96
Q

Discuss the main problem with the linear regression approach to classification.

A

f_hat(x) can be negative or greater than 1.

97
Q

3 Assumptions of binary logistic regression

A

Y1, Y2, …, Yn are independently distributed.

Yᵢ ~ Bin(nᵢ, 𝜋ᵢ), i.e. binary logistic regression model assumes binomial distribution of the response.

Assumes a linear relationship between the logit of the response and the explanatory variables logit(𝜋) = β₀ + βX.

98
Q

Define the logit function

A

The logit function is simply the logarithm of the odds:

logit(x) = log( x / (1 - x) )

99
Q

Give the mathematical definition of linear separability.

A

Let X0 and X1 be two sets of points in an n-dimensional Euclidean space.

Then X0 and X1 are linearly separable if there exists a vector of real numbers, w, and a scalar, k, such that:

every point x in X0 satisfies:
w · x > k

and every point x in X1 satisfies:
w · x < k

100
Q

Discuss the optimal separating hyperplane

A

The optimal separating hyperplane between two classes separates the two classes and maximizes the distance from either class.

Not only does this provide a unique solution to the separating hyperplane problem, but by maximizing the margin between the two classes on the training data, this leads to better classification performance on test data.

101
Q

Discuss the optimisation problem of the optimal separating hyperplane.

A

max_{ β, β₀, ||β|| = 1} M
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ M

The set of conditions ensure that all points are at least a signed distance, M, from the decision boundary defined by β and β₀, and we seek the largest such M and associated parameters.

We can get rid of the ||β|| = 1 constraint by replacing the conditions with:

yᵢ ( xᵢ’β + β₀ ) / ||β|| ≥ M

or equivalently:

yᵢ ( xᵢ’β + β₀ ) ≥ M ||β||

Since for any β and β₀ satisfying these inequalities, any positively scaled multiple satisfies them too, we can arbitrarily set ||β|| = 1/M.

Thus the optimisation problem is equivalent to:

min_{ β, β₀} ||β||² / 2
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ 1

These constraints define an empty slab or margin around the linear decision boundary of thickness 1 / ||β||

Hence, we choose β and β₀ to maximise its thickness.

102
Q

Optimal separating hyperplane:

Give the Lagrange (primal) function to be minimized,
as well as its solutions.

A

Minimize w.r.t. β, β₀:

Lp = ||β||² / 2 - Σ αᵢ [ yᵢ ( xᵢ’β + β₀ ) - 1 ]

β = Σ αᵢ yᵢ xᵢ
0 = Σ αᵢ yᵢ
103
Q

Discuss the Support Vector Classifier

A

The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier, where a small proportion of the training sample is allowed to cross the margins.

Rather than looking for the largest possible margin so that every observation is on the correct side of the margin, some observations are allowed to be on the incorrect side of the margins.

The margin is soft as a small number of observations violate the margin.

Softness is controlled by the slack variables which control the position of the observations relative to the margins and separating hyperplane.

104
Q

Give the optimisation problem relating to the support vector classifier

A

Define the slack variables ξ = (ξ1,ξ2,… ,ξN).

The support vector classifier maximises a soft margin.

The optimisation problem becomes:

max_{ β, β₀, ||β|| = 1} M
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ M ( 1 - ξᵢ )

ξᵢ ≥ 0
Σξᵢ ≤ constant

105
Q

Support Vector Classifier
Give an interpretation of the slack variables:

ξi = 0

A

ξi = 0
=> yᵢ (xᵢ’β + β₀) ≥ 1

Therefore the observation i lies outside the margin, on the correct side of the decision boundary.

106
Q

Support Vector Classifier
Give an interpretation of the slack variables:

0 < ξi < 1

A

0 < ξi < 1
=> 0 < yᵢ (xᵢ’β + β₀) < 1

Therefore the observation i lies within the margin, on the correct side of the decision boundary.

107
Q

Support Vector Classifier
Give an interpretation of the slack variables:

ξi = 1

A

ξi = 1
=> yᵢ (xᵢ’β + β₀) = 0

Therefore the observation i lies on the decision boundary.

108
Q

Support Vector Classifier
Give an interpretation of the slack variables:

ξi > 1

A

ξi > 1
=> yᵢ (xᵢ’β + β₀) < 0

Therefore the observation i lies on the wrong side of the decision boundary.

109
Q

Support Vector Classifier

Discuss the cost parameter

A

With a small value for c, we can tolerate many ξi > 0. Therefore many points are allowed to be inside the margin as well as on the other side of the decision boundary. This results in having many support points / vectors and our margin is much wider.

-> Will be stable but may have high bias.

With a large value for c, we cannot tolerate many ξi > 0. Therefore less points are allowed to be inside the margin and even less on the other side of the decision boundary (where ξi is large). This results in having only a few support vectors and our margin will be small.

This results in a less stable fit but lower bias.

When C = ∞, the optimal separating hyperplane is obtained (only when the data is linearly separable).

110
Q

Discuss the SVM

A

Both the optimal separating hyperplane and the support vector classifier finds linear decision boundaries in the original feature space. The idea for SVMs is to use the exact same formulation as for SVCs but in an enlarged feature space.

We then find a linear decision boundary in the enlarged feature space which then results in a non-linear decision boundary in the original feature space.

111
Q

What is the motivation for supervised Principal Components Analysis

A

One shortcoming of using PCA in a supervised learning problem, is that the principal components (which are linear combinations of input features, xi) are selected, in turn, as to capture the greatest possible variance in the input data.

The procedure, however, takes no consideration of the influence of any particular input variable, xi, on the response, yi.

This is the main motivation behind supervised principal components analysis.

112
Q

4 Steps of Supervised PCA

A

Assuming that there are p features measured on N observations. Let X be an N x p matrix of feature measurements, and let y be the N-vector of outcome measurements.

Supervised PCA follows 4 steps:

  • Compute standard regression coefficients for each feature. This gives an indication of the “correlation” of each feature with the response.
  • Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold θ in absolute value (θ is estimated by cross-validation).
  • Compute the first (or first few) principal components of the reduced data matrix.
  • Use these principal component(s) in a regression model to predict the outcome.
113
Q

Regularization:
J(B) = λ Σ|Bj|^q
Comment on the possibility of using other forms of J(B), instead of L1 and L2 regularisation.

A

The case q = 1 (lasso) is the smallest q such that the constraint region is convex.

Non-convex constraint regions make the optimization problem more difficult.

ELASTIC-NET

With q>1, |Bj|^q is differentiable at 0, and so does not share the ability of the lasso for setting coefficients exactly to zero.

Partly for this reason as well as for computational tractability, Zou and Hastie introduced the elasticnet penalty:

λ Σ ( αBj^2 + (1-α) |Bj| )

The elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like the ridge.

It also has considerable advantages over the Lq penalties.

114
Q

Discuss the starting values (initialization) for the weights of an artificial neural network.

A

Note that if the weights are near zero, then the operative part of the sigmoid is roughly linear, and hence the neural network collapses into an approximately linear model.

Usually starting values for weights are chosen to be random values near zero.

Hence the model starts out linear and becomes nonlinear as the weights increase.

Individual units localize to directions and introduce nonlinearities where needed.

Use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves.

Starting instead with large weights often leads to poor solutions.

115
Q

Discuss the isssue of overfitting w.r.t. neural networks

A

Often neural networks have too many weights and will overfit the data at the global minimum of R.

116
Q

Preventing Overfitting in Neural Networks:

Early Stopping

A

In early developments of neural networks, an early stopping rule was used to avoid overfitting. Here we train the model only for a while, and stop well before we approach the global minimum.

Since the weights start at a highly regularized (linear) solution, this has the effect of shrinking the final model toward a linear model.

A validation dataset is useful for determining when to stop, wince we expect the validation error to start increasing.

117
Q

Preventing Overfitting in Neural Networks:

Weight Decay

A

A more explicit method for regularization is weight decay, which is analagous to ridge regression for linear models.

We add a penalty to the error function R(θ) + λJ(θ), where λ>=0 is a tuning parameter.

J(θ) = Σβ^2 + Σα^2

Typically cross-validation is used to estimate λ.

118
Q

Issues in training neural networks:

Scaling of the inputs

A

Since the scaling of the inputs determines the effective scaling of the weights in the bottom layer, it can have a large effect on the quality of the final solution.

At the outset it is best to standardize all inputs to have mean zero and standard deviation one.

This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights.

119
Q

Issues in training neural networks:

Number of hidden units

A

In general, it is better to have too many hidden units than too few.

With too few hidden units, the model might not have enough flexibility to capture the nonlinearities in the data.

With too many hidden units, the extra weights can be shrunk towards zero if appropriate regularization is used.

Typically, the number of hidden units is somewhere in the range of 5 - 100, with the number increasing with the number of inputs and number of training cases.
o
It is common to put down a reasonably large number of units and train them with regularization.