Exam Questions Flashcards

Question 1

Q

Define what is meant by supervised learning

Answer

A

Supervised learning is 
... the machine learning task of 
... learning a function 
... that maps an input to an output 
... based on example input-output pairs.

Question 2

Q

Discuss the training data used in supervised learning

Answer

A

Supervised learning infers a function from a set of LABELED training data.

Each example is a pair consisting of

an input object (typically a vector) and
a desired output value (also called the supervisory signal).

Question 3

Q

Discuss the desired result of a supervised learning algorithm

Answer

A

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for UNSEEN instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way.

Question 4

Q

Define unsupervised learning

Answer

A

Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized.

Instead of responding to feedback, unsupervised learning
… IDENTIFIES COMMONALITIES in the data and
… reacts based on the presence or absence of such commonalities in each new piece of data.

Question 5

Q

2 Main classes of supervised learning problems

Answer

A

classification

- regression

Question 6

Q

Supervised learning:

Define classification

Answer

A

Classification is the problem of
identifying to which of a set of categories (sub-populations)
a new observation belongs,
on the basis of a training set of data
containing observations (or instances) whose category membership is known.

Question 7

Q

Supervised learning:

Define regression analysis

Answer

A

Regression analysis is a set of statistical processes for estimating the relationships among variables.

Regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables.

Question 8

Q

2 of the objectives of supervised learning

Answer

A

Inference

- Prediction

Question 9

Q

Define statistical inference

Answer

A

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.

Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.

It is assumed that the observed data set is sampled from a larger population.

Question 10

Q

Define predictive inference

Answer

A

Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations.

Question 11

Q

Describe least squares linear regression

Answer

A

Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model.

OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.

Question 12

Q

5 Assumptions of Ordinary Least Squares (OLS) Linear Regression

Answer

A

Linearity (Correct specification)
Strict exogeneity
No linear dependence in the regressors
Spherical errors
Normality

Question 13

Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Linearity (Correct specification)

Answer

A

The mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.

Question 14

Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Normality of the errors

Answer

A

The errors have normal distribution conditional on the regressors:

ε | X ~ N(0, σ² Iₙ )

Question 15

Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Lack of perfect multicollinearity in the predictors

Answer

A

For standard OLS estimation methods, the matrix X must have full column rank, p.

Question 16

Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Strict exogeneity

Answer

A

The errors in the regression should have a conditional expectation of zero: E[ ε | X ] = 0

The immediate consequence of the exogeneity assumption are that:

the errors, , have an expectation of zero: E[ ε ] = 0
the regressors are uncorrelated with the errors: E[ Xᵀε ] = 0

Question 17

Q

Assumptions of Ordinary Least Squares (OLS) Linear Regression:

Spherical errors

Answer

A

Var[ ε | X ] = σ² Iₙ , where In is the identity matrix in dimension n and 2 is a parameter which determines the variance of each observation.
This assumption can be split into two parts:

Homoscedasticity:
E[ εᵢ² | X ] = σ², which means that the error term has the same variance σ² in each observation.

No autocorrelation:
The errors are uncorrelated between observations: E[ εᵢ εⱼ | X ] = 0 for i ≠ j .

Question 18

Q

K-nearest neighbours classification

Answer

A

A non-parametric method used for classification.

The input consists of the k closest training examples in the feature space.

The output is a class membership.

An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours.

Question 19

Q

K-nearest neighbours regression

Answer

A

A non-parametric method used for classification.

The input consists of the k closest training examples in the feature space.

The output is the property value for the object.

The value is the average of the values of its k nearest neighbours.

Question 20

Q

History of Neural Networks:

McCulloch and Pitts

Answer

A

1943
Warren McCulloch - a neurophysiologist
Walter Pitts - a mathematician

They wrote a paper on how neurons might function.
They modelled a simple neural network with electrical circuits.

Question 21

Q

History of Neural Networks:

2 Major concepts that preceded Neural Networks

Answer

A

Threshold logic - converting continuous input to discrete output
Hebbian learning - a model of learning based on neural plasticity, proposed in “The Organization of Behaviour” by Donald Hebb in 1949

Question 22

Q

McCulloch-Pitts neuron

Answer

A

Takes a weighted sum of some inputs and returns ‘0’ if the result is below a threshold and ‘1’ otherwise.

Question 23

Q

History of Neural Networks:

Mark I Perceptron

Answer

A

In 1958, Frank Rosenblatt - a psychologist at Cornell - proposed the idea of a Perceptron.

It was a system with a simple input output relationship modelled on a McCulloch-Pitts neuron.

Question 24

Q

History of Neural Networks:

Advantage of the Mark I Perceptron

Answer

A

Its weights could be ‘learnt’ through successively passed inputs, while minimizing the difference between desired and actual output.

Question 25

Q

History of Neural Networks:

Major shortcoming of the Mark I Perceptron

Answer

A

It could only learn to separate linearly separable classes, and thus it wasn’t able to model the simple, but non-linear XOR (exclusive-or) circuit.

Question 26

Q

History of Neural Networks:

Marvin Minsky’s book

Answer

A

Marvin Minsky’s book “Perceptrons” argued that Rosenblatt’s single perceptron approach to neural netwroks could not be translated effectively into multi-layered neural networks.

To evaluate the correct relative values of the weights of the neurons, spread across layers; based on the final output, would take several (if not infinite) number of iterations.

Question 27

Q

History of Neural Networks:

Backpropagation

Answer

A

Backpropagation’s potential for neural netws was first noticed by Paul Werbos in his PhD thesis on their importance.

This work was re-discovered by Rumelhart, Hinton and Williams, who republished it in a clear and detailed framework.

Question 28

Q

What is meant by feature extraction?

Answer

A

Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing.

A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process.
Feature extraction is the name for methods that combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

The process of feature extraction is useful when you need to reduce the number of resources needed for processing without losing important or relevant information.
Feature extraction can also reduce the amount of redundant data for a given analysis.
Also, the reduction of the data and the machine’s efforts in building variable combinations (features) facilitate the following learning and generalization steps in the statistical learning process.

Question 29

Q

Principal Components Analysis

Answer

A

A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables (called principal components).

The transformation is defined in such a way that the first principal component accounts for the greatest possible amount of variability in the data, and each succeeding component in turn has the highest variance possible subject to the constraint that it is orthogonal to the preceding components.

The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.

PCA is often used as a dimensionality reduction technique, by using only the first few principal components (which capture the greatest amount of variability in the data) as inputs.

Question 30

Q

4 Steps involved in Supervised PCA

Answer

A

Compute standard regression coefficients for each feature. This gives an indication of the “correlation” of each feature with the response.

Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold θ in absolute value (θ is estimated by cross-validation).

Compute the first (or first few) principal components of the reduced data matrix.

Use these principal component(s) in a regression model to predict the outcome.

Question 31

Q

Maximal Margin Classifier

Answer

A

In the case of linearly separable data, we can use the optimal separating hyperplane to construct a maximum margin classifier.

The hyperplane, defined as {x: w·x + b = 0} will perfectly split the data from each of the two classes (as they are linearly separable), and we can use a simple classification rule as:

G(x) = sign[ w·x + b ]

Question 32

Q

Problem with a maximum margin classifier:

Answer

A

The existence of an optimal separating hyperplane cannot be guaranteed.
Or, even if it does exist, the data might be noisy, meaning that the maximal margin classifier provides a poor solution.

Question 33

Q

Define Variable selection

Answer

A

Variable selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

Question 34

Q

Feature selection techniques are used for 4 reasons:

Answer

A

Simplification of models to make them more interpretable.
To shorten training times.
To avoid the curse of dimensionality.
To enhance generalization by reducing overfitting.

Question 35

Q

Best subset selection

Answer

A

When performing best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. I.e. we fit all p models that contain exactly one predictor, all ₚC₂ = p(p-1)/2 models that contain exactly two predictors, and so forth.

The problem of selecting the best model from among the 2ᵖ possibilities considered by best subset selection is not trivial.

Question 36

Q

3 Steps of the Best Subset Selection Algorithm

Answer

A

Let M₀ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

For k = 1, 2, …, p:

Fit all ₚCₖ models that contain exactly k predictors.
Pick the best among these ₚCₖ models and call it Mₖ. Here best is defined as having the smallest RSS, or equivalently the largest R².
Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².

Question 37

Q

Limitations of Best Subset Selection

Answer

A

While best subset selection is simple and conceptually appealing, it suffers from computational limitations.

The number of possible models that must be considered grows rapidly as p increases. In general, there are 2^p models that involve subsets of p predictors.
Consequently, best subset selection becomes computationally infeasible for values of p greater than ~40.

Question 38

Q

Forward Stepwise selection

Answer

A

Forward Stepwise selection begins with a model containing no predictors, then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

In particular, at each step, the variable that gives the greatest additional improvement to the fit is added to the model.

Question 39

Q

3 Steps of the Forward Stepwise Selection algorithm

Answer

A

Let M₀ denote the null model, which contains no predictors.

For k = 0, …, p-1:

Consider all p - k models that augment the predictors in Mk with one additional predictor.
Choose the best among these p-k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R².
Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².

Question 40

Q

Backward Stepwise selection

Answer

A

Backward Stepwise selection begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Question 41

Q

3 Steps of the Backward Stepwise selection algorithm

Answer

A

Let Mₚ denote the full model, which contains all p predictors.

For k = p, p-1, …, 1:

Consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors.
Choose the best among these k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R2.
Select a single best model from among M0, …, Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2.

Question 42

Q

Contrast forward and backward stepwise selection

Answer

A

Backward stepwise selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n<p></p>

Question 43

Q

What is the purpose of regularisation?

Answer

A

To create a less complex (parsimonious) model when dealing with a large number of features.

Question 44

Q

Define what is meant by a PGM

Answer

A

A Probabilistic Graphical Model is
… a probabilistic model
… for which a graph expresses the conditional dependence structure
… between a set of random variables.

Question 45

Q

2 Main types of PGMs

Answer

A

Directed PGMs

- Undirected PGMs

Question 46

Q

3 Properties of Undirected PGMs

Answer

A

The edges in the graph have no directions
Only dependence is indicated.
No causality can be inferred.

Question 47

Q

2 Properties of directed PGMs

Answer

A

The edges in the graph have directions.

- The directions indicate causality and dependence.

Question 48

Q

Main use of PGMs

Answer

A

They are used to investigate the joint distribution of a set of random variables.

Question 49

Q

Define (in the context of a PGM):

Adjacent vertices

Answer

A

Two vertices, X and Y, are called adjacent if there is an edge joining them.

This is denoted by X ~ Y.

Question 50

Q

Define (in the context of a PGM):

A path

Answer

A

A path X1, X2, …, Xn is a set of vertices that are joined.

I.e. X_{i-1} ~ Xi for i = 2, …, n

Question 51

Q

Define (in the context of a PGM):

A complete graph

Answer

A

A complete graph is a graph with every pair of vertices joined by an edge.

Question 52

Q

Define (in the context of a PGM):

A sub-graph

Answer

A

A subgraph U ∈ V is a subset of vertices together with their edges.

Question 53

Q

Define (in the context of a PGM):

A separator

Answer

A

If A, B, and C are subgraphs, then C is said to separate A and B if every path between A and B intersects a node in C.

Question 54

Q

Define (in the context of a PGM):

The pairwis

Answer

A

The absence of an edge between X and Y in a graph G, implies conditional independence given the rest of the vertices in G.

I.e. No edge joining X and Y <=> X⟂Y | rest

Question 55

Q

Define (in the context of a PGM):

The Global Markov properties of a graph

Answer

A

In a Markov graph G with subgraphs A, B and C,

if C separates A and B
then
A ⊥ B | C.

Question 56

Q

Define (in the context of a PGM):

A clique

Answer

A

A clique is a complete subgraph - a set of vertices that are all adjacent to one another.

Question 57

Q

Define (in the context of a PGM):

A maximal clique

Answer

A

A clique is called maximal if it is a clique and no other vertices can be added to it and still yield a clique.

Question 58

Q

The potential of a clique

Answer

A

A probability density function f over a Markov graph G can be can represented as

f(x) = (1/Z) ∏ ψ_c(x_c)

where C is the set of maximal cliques, and the positive functions ψ_c(.) are called clique potentials.

Question 59

Q

The partition function

Answer

A

The quantity

Z = ∑_x ∏_c { ψ_c(x_c) }

is the normalizing constant for f(x), also known as the partition function.

Question 60

Q

A pairwise Markov graph

Answer

A

For a pairwise Markov graph, there is:

A potential function for each edge
At most second order interactions are represented.

Question 61

Q

Advantages of Markov Graphs

Answer

A

more parsimonious in terms of parameters,
easier to work with,
give the minimal complexity implied by the graph structure.

The models for both continuous and discrete data are functions of only the pairwise marginal distributions of the variables represented in the edge set.

Question 62

Q

Discuss preconditioning as a method of improving the performance of a variable selection approach such as the lasso.

Answer

A

Variable selection techniques can be adversely affected by the noise present in a dataset.

Preconditioning involves reducing the noise in the dataset such that the lasso can perform better.

One possible method of preconditioning is to use supervised principal components.

Apply the usual supervised principal component algorithm and consider the m principal components obtained from this method.. Let Z = [z1, z2, …, zm] be the matrix consisting of the m supervised principal components.

Now, to reduce the noise present in the dataset, regress z1 onto y and obtain fitted values y_hat.

Then the lasso can be applied when y_hat is taken to be the response and X to be the predictors.

We are then interested in the variables that are selected from this process.

De-noising y is expected to enhance the performance of the lasso.

Why would this work?

Overfitting is prevented by using the denoised response variable
The tuning parameter λ is objectively chosen to obtain sparsity.

Preconditioning is not limited to using supervised PCA.

Question 63

Q

How does a Neural Network compare to using basis function expansions?

Answer

A

The neurons in the activation outputs, o, are simply calculating basis expansions of the original features in X.

The subtle difference for the two methods is that for NN’s the parameters of the basis functions are learnt from the data.

Question 64

Q

What is meant by a linear basis function expansion?

Answer

A

The core idea is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features.

Once the basis functions hm have been determined, the models are linear in these new variables, and the fitting proceeds as before.

Answer 65

A

Cubic spline models fit piecewise cubic polynomials to the data, subject to 3 constraints:

The fit should be continuous at the knots.
The fit should have continuous first derivatives at the knots.
The fit should have continuous second derivatives at the knots.

Answer 66

A

A natural cubic spline adds additional constraints, namely that the function is linear beyond the boundary knots.

This frees up four degrees of freedom (two constraints each in both boundary regions), which can be spent more profitably by sprinkling more knots in the interior region.

Answer 67

A

To implement natural cubic splines in a binary classification context, the natural cubic spline can be used in the usual regression setting, but with the response Y ∈ {-1, 1}.

The results from such a model can be quite difficult to interpret, therefore a better alternative is to use logistic regression, with a basis function expansion.

Answer 68

A

Smoothing splines avoid the knot selection problem completely by using a maximal set of knots.

The complexity of the fit is controlled by regularization.

Answer 69

A

The perceptron learning algorithm tries to find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary.

If a response yi = 1 is misclassified, then xi’β + β_0 < 0, and the opposite for a misclassified response with yi = −1.

SEE NOTES

Answer 70

A

When the data are separable, there are many solutions, and which one is found depends on the starting values.
The “finite” number of steps can be very large. The smaller the gap, the longer the time to find it.
When the data are not separable, the algorithm will not converge, and cycles develop. The cycles can be long and therefore hard to detect.

Answer 71

A

Two sets of points are linearly separable if there exists at least one hyperplane in the space (or a line in the two-dimensional case) with all the points from either class on the same side of the plane.

I.e., the points from the different classes are able to be perfectly separated by some linear boundary.

Answer 72

A

LDA predicts two normal density functions (one for each class) that creates a linear boundary where they intersect, whereas logistic regression only predicts the log-odds function (logit) between the two classes.

Answer 73

A

BLR:
Based on Maximum likelihood estimation.

LDA:
Based on Least squares estimation; equivalent to linear regression with binary predictand.

Answer 74

A

Generalized Linear Models is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

This is done by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Answer 75

A

A probability distribution from the exponential family
A linear predictor η = Xβ
A link function g such that E(Y) = μ = g⁻¹(η)

Answer 76

A

When modelling a highly dimensional problem with a relatively small dataset, overfitting becomes a major concern. Therefore, heavy / rigid structural assumptions as well as variable selection are needed to provide sensible results in such cases.

In this setting, very few predictors may truly have a significant influence.

Variable selection can thus help make the model:

simpler,
easier to interpret, - computationally more efficient and
exhibit lower variance.

Answer 77

A

Forward-stagewise regression (FS) is even more constrained than forward-stepwise regression.

It starts like forward-stepwise regression, with an intercept equal to y, and centered predictors with coefficients initially all 0.

At each step the algorithm identifies the variable most correlated with the current residual.

It then computes the simple linear regression coefficient of the residual on this chosen variable, and then adds it to the current coefficient for that variable.

This is continued till none of the variables have correlation with the residuals—i.e. the least-squares fit when N > p.

Unlike forward-stepwise regression, NONE OF THE OTHER VARIABLES ARE ADJUSTED when a term is added to the model.

As a consequence, forward stagewise can take many more than p steps to reach the least squares fit, and historically has been dismissed as being inefficient.

It turns out that this “slow fitting” can pay dividends in high-dimensional problems.

Answer 78

A

Rather than searching through all possible subsets of predictors, stepwise approaches take a more constrained search:

They explore a far more restricted set of models.
They are greedy algorithms, as a particular predictor can be included or dropped without considering the future steps of the procedure.

Therefore, stepwise procedures are unlikely to find the same “best” subset of predictors as the best-subset approach.

However, an advantage is that stepwise procedures are much less computationally expensive as they search over a highly restricted set of models.

Answer 79

A

L1 regularization modifies the loss function by adding some multiple of the sum of the absolute magnitudes of the model parameters as a penalty term to the loss function.

I.e. the regularization term in the cost function becomes:
J(𝜷) = ||𝜷|| = Σ |βⱼ|

When applied to regression, the procedure obtained when using an L1 penalty term is referred to as Lasso regression.

Answer 80

A

L2 regularization involves adding a regularisation term to the loss function used when optimizing the parameters of some regression model.

L2 regularization modifies the loss function by adding some multiple of the sum of the squared magnitudes of the model parameters as a penalty term to the loss function.

I.e. the regularization term in the cost function above becomes:
J(𝜷) = ||𝜷||_2^2 = Σ(βⱼ)²

When applied to regression, the procedure obtained when using an L2 penalty term is referred to as Ridge regression.

Answer 81

A

In general, the optimal value for k will depend on the bias-variance tradeoff.

A small value for k will provide the most flexible fit, which will have low bias but high variance.
This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation.

In contrast, larger values of k provide a smoother and less variable fit; the prediction in a region is an average of several points, and so the influence of a single observation is smaller.

Answer 82

A

In contrast to the strong assumptions made by the linear model, k-NN makes no assumptions about the functional form of the problem being solved.

Therefore, k-NN is referred to as a non-parametric machine learning algorithm, whilst linear regression is an example of a parametric approach as it assumes a linear functional form for f(X).

Answer 83

A

Non-linearity.
If the relationship between X and y are known to be non-linear, k-NN would be an obviously better choice.
Versatility
Linear regression is not very useful for classification, whereas k-NN is easily applied in a classification setting.
Accuracy.
If you need higher accuracy than what a linear model would output (due to it only outputting some linear combination of input features and not being able to represent complex, higher order features).
Easy to understand.
Not that linear regression is particularly difficult to explain, but the concept of distance from some observation is natural and (arguably) easier to explain than multiple linear regression in a highly dimensional problem.

Answer 84

A

Small dataset (small n).
If there is a small number of observations per predictor, a parametric method, like linear regression, might work better.
Highly dimensional dataset (large p).
Thanks to their rigid assumptions, linear models are less likely to overfit in high dimensional datasets.
Interpretability.
A k-NN lacks the same interpretability of a linear model (i.e. coefficients indicating the influence of each parameter on the response). It fails to describe our underlying data as aptly as linear regression would.
Speed, cost and performance.
A k-NN is computationally expensive, both in terms of memory and computation.
Storage: To store a k-NN model would involve storing all observations from the training data. In comparison, linear regression would store a coefficient for each feature in the input dataset, together with a bias - resulting in a much more computationally
Computation: To classify a new example would require comparing the example to every observation in the training set, which is much more expensive than simply multiplying the features by a set of coefficients - as would be the case for linear regression.
Model robustness.
If a small value of k is used, k-NN will be sensitive to irrelevant features.

Answer 85

A

Ridge regression does a proportional shrinkage.

Lasso translates each coefficient by a constant factor λ, truncating at zero.

Ridge regression generally only shrinks the variable coefficients to values close to zero. The Lasso, on the other hand, can set coefficients equal to zero - thereby performing variable selection.

Answer 86

A

At the one end of the spectrum, linear least squares makes very strong structural assumptions about the distribution of the data and the errors.

At the other end, k-Nearest Neighbours (especially with small values of k), makes no assumptions about the functional form of the problem being solved, and is entirely non-parametric.

We can see where different models fall on this spectrum by considering the assumptions made by such models.

In descending order of assumption strength:

linear regression
generalized linear models, which try to slack the OLS assumption that errors are normally distributed. In the category of GLMs also falls logistic regression,
local regression tries to slack the assumption of “global” linearity made by OlS and instead opts for only local linearity in the immediate surrounding of some point,
Basis function expansions and splines also try to further slack the linearity assumption made by OLS.
The Support Vector Machine also tries to allow for non-linearity through the use of kernels.

Up to so far, all methods still assume (at least to some extent) a specific “structural form” dictating the assumed distribution of the underlying data.

Deep Neural Networks further try to slack these assumptions about the form of the model. They are “Universal Function Approximators”, and are able to approximate any deterministic function to an arbitrary degree of accuracy. They do still, however, assume that the observations were generated by some deterministic function.
As mentioned, at the very end of this spectrum is k-Nearest Neighbours, which makes no assumptions about the form of the underlying distribution.

Answer 87

A

Least squares assumes f(x) is well approximated by a GLOBALLY LINEAR function.
k-Nearest neighbours assumes f(x) is well approximated by a LOCALLY CONSTANT function

Answer 88

A

Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models.
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

Answer 89

A

The penalised residual sum of squares can be cast into a Bayesian framework where we think of:

The unpenalised residual sum of squares as the log-likelihood of the data we observe.
The penalty function |β|^q as the log-prior density for β.

The penalised RSS can then be viewed as a compromise between the observed data and the prior, i.e. a posterior estimate of β.

Answer 90

A

An independent double exponential (or Laplace) distribution for each input, with density
( 1 / 2t ) exp( − |β| / t )
and
t = 1/λ.

Answer 91

A

The prior for each parameter in β is N(0, t)².

The ridge estimate then follows to be the mode of the posterior, and since its Gaussian, it’s also the posterior mean.

Answer 92

A

Here each of the response categories are coded via an indicator variable.

Thus if G has K classes, there will be K such indicators, Yk, k = 1, …, K,
Yk = 1 if G = k
Yk = 0 otherwise

These are collected together in a vector Y = (Y1, …, Yk), and the N training instances of these form an N x K indicator response matrix Y.

Y is a matrix of 0s and 1s, with each row having a single 1.

We fit a linear regression model to each of the columns of Y simultaneously, and the fit is given by:

Y_hat = X ( X’X )⁻ⁱ X’Y

Note that we have a coefficient vector for each response column yk, and hence a (p+1)xK coefficient matrix:

B_hat = ( X’X )⁻ⁱ X’Y

A new observation with input x is classified as follows:

compute the fitted output f_hat(x)’ = (1, x’) B_hat, a K vector
identify the largest component and classify accordingly: G_hat(x) = argmax_{k in G} f_hat_k(x)

Answer 93

A

Akaike information criterion (AIC)
Bayesian information criterion (BIC)
Cross-validation
Mallows’s Cp

Answer 94

A

The standard errrors obtained from the final model resulting from the stepwise procedures cannot be used, since they do not account for the search process.

They must be calculated using a bootstrap procedure.

Answer 95

A

View the regression as an estimate of conditional expectation.

For the random variable Yk,
E(Yk|X = x) = Pr(G = k|X = x),
so conditional expectation of each of the Yk seems a sensible goal.

Answer 96

A

f_hat(x) can be negative or greater than 1.

Answer 97

A

Y1, Y2, …, Yn are independently distributed.

Yᵢ ~ Bin(nᵢ, 𝜋ᵢ), i.e. binary logistic regression model assumes binomial distribution of the response.

Assumes a linear relationship between the logit of the response and the explanatory variables logit(𝜋) = β₀ + βX.

Answer 98

A

The logit function is simply the logarithm of the odds:

logit(x) = log( x / (1 - x) )

Answer 99

A

Let X0 and X1 be two sets of points in an n-dimensional Euclidean space.

Then X0 and X1 are linearly separable if there exists a vector of real numbers, w, and a scalar, k, such that:

every point x in X0 satisfies:
w · x > k

and every point x in X1 satisfies:
w · x < k

Answer 100

A

The optimal separating hyperplane between two classes separates the two classes and maximizes the distance from either class.

Not only does this provide a unique solution to the separating hyperplane problem, but by maximizing the margin between the two classes on the training data, this leads to better classification performance on test data.

Answer 101

A

max_{ β, β₀, ||β|| = 1} M
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ M

The set of conditions ensure that all points are at least a signed distance, M, from the decision boundary defined by β and β₀, and we seek the largest such M and associated parameters.

We can get rid of the ||β|| = 1 constraint by replacing the conditions with:

yᵢ ( xᵢ’β + β₀ ) / ||β|| ≥ M

or equivalently:

yᵢ ( xᵢ’β + β₀ ) ≥ M ||β||

Since for any β and β₀ satisfying these inequalities, any positively scaled multiple satisfies them too, we can arbitrarily set ||β|| = 1/M.

Thus the optimisation problem is equivalent to:

min_{ β, β₀} ||β||² / 2
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ 1

These constraints define an empty slab or margin around the linear decision boundary of thickness 1 / ||β||

Hence, we choose β and β₀ to maximise its thickness.

Answer 102

A

Minimize w.r.t. β, β₀:

Lp = ||β||² / 2 - Σ αᵢ [ yᵢ ( xᵢ’β + β₀ ) - 1 ]

β = Σ αᵢ yᵢ xᵢ
0 = Σ αᵢ yᵢ

Answer 103

A

The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier, where a small proportion of the training sample is allowed to cross the margins.

Rather than looking for the largest possible margin so that every observation is on the correct side of the margin, some observations are allowed to be on the incorrect side of the margins.

The margin is soft as a small number of observations violate the margin.

Softness is controlled by the slack variables which control the position of the observations relative to the margins and separating hyperplane.

Answer 104

A

Define the slack variables ξ = (ξ1,ξ2,… ,ξN).

The support vector classifier maximises a soft margin.

The optimisation problem becomes:

max_{ β, β₀, ||β|| = 1} M
subject to:
yᵢ ( xᵢ’β + β₀ ) ≥ M ( 1 - ξᵢ )

ξᵢ ≥ 0
Σξᵢ ≤ constant

Answer 105

A

ξi = 0
=> yᵢ (xᵢ’β + β₀) ≥ 1

Therefore the observation i lies outside the margin, on the correct side of the decision boundary.

Answer 106

A

0 < ξi < 1
=> 0 < yᵢ (xᵢ’β + β₀) < 1

Therefore the observation i lies within the margin, on the correct side of the decision boundary.

Answer 107

A

ξi = 1
=> yᵢ (xᵢ’β + β₀) = 0

Therefore the observation i lies on the decision boundary.

Answer 108

A

ξi > 1
=> yᵢ (xᵢ’β + β₀) < 0

Therefore the observation i lies on the wrong side of the decision boundary.

Answer 109

A

With a small value for c, we can tolerate many ξi > 0. Therefore many points are allowed to be inside the margin as well as on the other side of the decision boundary. This results in having many support points / vectors and our margin is much wider.

-> Will be stable but may have high bias.

With a large value for c, we cannot tolerate many ξi > 0. Therefore less points are allowed to be inside the margin and even less on the other side of the decision boundary (where ξi is large). This results in having only a few support vectors and our margin will be small.

This results in a less stable fit but lower bias.

When C = ∞, the optimal separating hyperplane is obtained (only when the data is linearly separable).

Answer 110

A

Both the optimal separating hyperplane and the support vector classifier finds linear decision boundaries in the original feature space. The idea for SVMs is to use the exact same formulation as for SVCs but in an enlarged feature space.

We then find a linear decision boundary in the enlarged feature space which then results in a non-linear decision boundary in the original feature space.

Answer 111

A

One shortcoming of using PCA in a supervised learning problem, is that the principal components (which are linear combinations of input features, xi) are selected, in turn, as to capture the greatest possible variance in the input data.

The procedure, however, takes no consideration of the influence of any particular input variable, xi, on the response, yi.

This is the main motivation behind supervised principal components analysis.

Answer 112

A

Assuming that there are p features measured on N observations. Let X be an N x p matrix of feature measurements, and let y be the N-vector of outcome measurements.

Supervised PCA follows 4 steps:

Compute standard regression coefficients for each feature. This gives an indication of the “correlation” of each feature with the response.
Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold θ in absolute value (θ is estimated by cross-validation).
Compute the first (or first few) principal components of the reduced data matrix.
Use these principal component(s) in a regression model to predict the outcome.

Answer 113

A

The case q = 1 (lasso) is the smallest q such that the constraint region is convex.

Non-convex constraint regions make the optimization problem more difficult.

ELASTIC-NET

With q>1, |Bj|^q is differentiable at 0, and so does not share the ability of the lasso for setting coefficients exactly to zero.

Partly for this reason as well as for computational tractability, Zou and Hastie introduced the elasticnet penalty:

λ Σ ( αBj^2 + (1-α) |Bj| )

The elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like the ridge.

It also has considerable advantages over the Lq penalties.

Answer 114

A

Note that if the weights are near zero, then the operative part of the sigmoid is roughly linear, and hence the neural network collapses into an approximately linear model.

Usually starting values for weights are chosen to be random values near zero.

Hence the model starts out linear and becomes nonlinear as the weights increase.

Individual units localize to directions and introduce nonlinearities where needed.

Use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves.

Starting instead with large weights often leads to poor solutions.

Answer 115

A

Often neural networks have too many weights and will overfit the data at the global minimum of R.

Answer 116

A

In early developments of neural networks, an early stopping rule was used to avoid overfitting. Here we train the model only for a while, and stop well before we approach the global minimum.

Since the weights start at a highly regularized (linear) solution, this has the effect of shrinking the final model toward a linear model.

A validation dataset is useful for determining when to stop, wince we expect the validation error to start increasing.

Answer 117

A

A more explicit method for regularization is weight decay, which is analagous to ridge regression for linear models.

We add a penalty to the error function R(θ) + λJ(θ), where λ>=0 is a tuning parameter.

J(θ) = Σβ^2 + Σα^2

Typically cross-validation is used to estimate λ.

Answer 118

A

Since the scaling of the inputs determines the effective scaling of the weights in the bottom layer, it can have a large effect on the quality of the final solution.

At the outset it is best to standardize all inputs to have mean zero and standard deviation one.

This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights.

Answer 119

A

In general, it is better to have too many hidden units than too few.

With too few hidden units, the model might not have enough flexibility to capture the nonlinearities in the data.

With too many hidden units, the extra weights can be shrunk towards zero if appropriate regularization is used.

Typically, the number of hidden units is somewhere in the range of 5 - 100, with the number increasing with the number of inputs and number of training cases.
o
It is common to put down a reasonably large number of units and train them with regularization.

Brainscape's Knowledge GenomeTM

Exam Questions Flashcards

Brainscape's Knowledge Genome^TM