Exam Questions Flashcards
Define what is meant by supervised learning
Supervised learning is ... the machine learning task of ... learning a function ... that maps an input to an output ... based on example input-output pairs.
Discuss the training data used in supervised learning
Supervised learning infers a function from a set of LABELED training data.
Each example is a pair consisting of
- an input object (typically a vector) and
- a desired output value (also called the supervisory signal).
Discuss the desired result of a supervised learning algorithm
A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
An optimal scenario will allow for the algorithm to correctly determine the class labels for UNSEEN instances.
This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way.
Define unsupervised learning
Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized.
Instead of responding to feedback, unsupervised learning
… IDENTIFIES COMMONALITIES in the data and
… reacts based on the presence or absence of such commonalities in each new piece of data.
2 Main classes of supervised learning problems
- classification
- regression
Supervised learning:
Define classification
Classification is the problem of
identifying to which of a set of categories (sub-populations)
a new observation belongs,
on the basis of a training set of data
containing observations (or instances) whose category membership is known.
Supervised learning:
Define regression analysis
Regression analysis is a set of statistical processes for estimating the relationships among variables.
Regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables.
2 of the objectives of supervised learning
- Inference
- Prediction
Define statistical inference
Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.
Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.
It is assumed that the observed data set is sampled from a larger population.
Define predictive inference
Predictive inference is an approach to statistical inference that emphasizes the prediction of future observations based on past observations.
Describe least squares linear regression
Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model.
OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.
5 Assumptions of Ordinary Least Squares (OLS) Linear Regression
- Linearity (Correct specification)
- Strict exogeneity
- No linear dependence in the regressors
- Spherical errors
- Normality
Assumptions of Ordinary Least Squares (OLS) Linear Regression:
Linearity (Correct specification)
The mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
Assumptions of Ordinary Least Squares (OLS) Linear Regression:
Normality of the errors
The errors have normal distribution conditional on the regressors:
ε | X ~ N(0, σ² Iₙ )
Assumptions of Ordinary Least Squares (OLS) Linear Regression:
Lack of perfect multicollinearity in the predictors
For standard OLS estimation methods, the matrix X must have full column rank, p.
Assumptions of Ordinary Least Squares (OLS) Linear Regression:
Strict exogeneity
The errors in the regression should have a conditional expectation of zero: E[ ε | X ] = 0
The immediate consequence of the exogeneity assumption are that:
- the errors, , have an expectation of zero: E[ ε ] = 0
- the regressors are uncorrelated with the errors: E[ Xᵀε ] = 0
Assumptions of Ordinary Least Squares (OLS) Linear Regression:
Spherical errors
Var[ ε | X ] = σ² Iₙ , where In is the identity matrix in dimension n and 2 is a parameter which determines the variance of each observation.
This assumption can be split into two parts:
Homoscedasticity:
E[ εᵢ² | X ] = σ², which means that the error term has the same variance σ² in each observation.
No autocorrelation:
The errors are uncorrelated between observations: E[ εᵢ εⱼ | X ] = 0 for i ≠ j .
K-nearest neighbours classification
A non-parametric method used for classification.
The input consists of the k closest training examples in the feature space.
The output is a class membership.
An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours.
K-nearest neighbours regression
A non-parametric method used for classification.
The input consists of the k closest training examples in the feature space.
The output is the property value for the object.
The value is the average of the values of its k nearest neighbours.
History of Neural Networks:
McCulloch and Pitts
1943
Warren McCulloch - a neurophysiologist
Walter Pitts - a mathematician
They wrote a paper on how neurons might function.
They modelled a simple neural network with electrical circuits.
History of Neural Networks:
2 Major concepts that preceded Neural Networks
- Threshold logic - converting continuous input to discrete output
- Hebbian learning - a model of learning based on neural plasticity, proposed in “The Organization of Behaviour” by Donald Hebb in 1949
McCulloch-Pitts neuron
Takes a weighted sum of some inputs and returns ‘0’ if the result is below a threshold and ‘1’ otherwise.
History of Neural Networks:
Mark I Perceptron
In 1958, Frank Rosenblatt - a psychologist at Cornell - proposed the idea of a Perceptron.
It was a system with a simple input output relationship modelled on a McCulloch-Pitts neuron.
History of Neural Networks:
Advantage of the Mark I Perceptron
Its weights could be ‘learnt’ through successively passed inputs, while minimizing the difference between desired and actual output.
History of Neural Networks:
Major shortcoming of the Mark I Perceptron
It could only learn to separate linearly separable classes, and thus it wasn’t able to model the simple, but non-linear XOR (exclusive-or) circuit.
History of Neural Networks:
Marvin Minsky’s book
Marvin Minsky’s book “Perceptrons” argued that Rosenblatt’s single perceptron approach to neural netwroks could not be translated effectively into multi-layered neural networks.
To evaluate the correct relative values of the weights of the neurons, spread across layers; based on the final output, would take several (if not infinite) number of iterations.
History of Neural Networks:
Backpropagation
Backpropagation’s potential for neural netws was first noticed by Paul Werbos in his PhD thesis on their importance.
This work was re-discovered by Rumelhart, Hinton and Williams, who republished it in a clear and detailed framework.
What is meant by feature extraction?
Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing.
A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process.
Feature extraction is the name for methods that combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.
The process of feature extraction is useful when you need to reduce the number of resources needed for processing without losing important or relevant information.
Feature extraction can also reduce the amount of redundant data for a given analysis.
Also, the reduction of the data and the machine’s efforts in building variable combinations (features) facilitate the following learning and generalization steps in the statistical learning process.
Principal Components Analysis
A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables (called principal components).
The transformation is defined in such a way that the first principal component accounts for the greatest possible amount of variability in the data, and each succeeding component in turn has the highest variance possible subject to the constraint that it is orthogonal to the preceding components.
The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set.
PCA is often used as a dimensionality reduction technique, by using only the first few principal components (which capture the greatest amount of variability in the data) as inputs.
4 Steps involved in Supervised PCA
Compute standard regression coefficients for each feature. This gives an indication of the “correlation” of each feature with the response.
Form a reduced data matrix consisting of only those features whose univariate coefficient exceeds a threshold θ in absolute value (θ is estimated by cross-validation).
Compute the first (or first few) principal components of the reduced data matrix.
Use these principal component(s) in a regression model to predict the outcome.
Maximal Margin Classifier
In the case of linearly separable data, we can use the optimal separating hyperplane to construct a maximum margin classifier.
The hyperplane, defined as {x: w·x + b = 0} will perfectly split the data from each of the two classes (as they are linearly separable), and we can use a simple classification rule as:
G(x) = sign[ w·x + b ]
Problem with a maximum margin classifier:
The existence of an optimal separating hyperplane cannot be guaranteed.
Or, even if it does exist, the data might be noisy, meaning that the maximal margin classifier provides a poor solution.
Define Variable selection
Variable selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.
Feature selection techniques are used for 4 reasons:
- Simplification of models to make them more interpretable.
- To shorten training times.
- To avoid the curse of dimensionality.
- To enhance generalization by reducing overfitting.
Best subset selection
When performing best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. I.e. we fit all p models that contain exactly one predictor, all ₚC₂ = p(p-1)/2 models that contain exactly two predictors, and so forth.
The problem of selecting the best model from among the 2ᵖ possibilities considered by best subset selection is not trivial.
3 Steps of the Best Subset Selection Algorithm
- Let M₀ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
For k = 1, 2, …, p:
- Fit all ₚCₖ models that contain exactly k predictors.
- Pick the best among these ₚCₖ models and call it Mₖ. Here best is defined as having the smallest RSS, or equivalently the largest R².
- Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².
Limitations of Best Subset Selection
While best subset selection is simple and conceptually appealing, it suffers from computational limitations.
The number of possible models that must be considered grows rapidly as p increases. In general, there are 2^p models that involve subsets of p predictors.
Consequently, best subset selection becomes computationally infeasible for values of p greater than ~40.
Forward Stepwise selection
Forward Stepwise selection begins with a model containing no predictors, then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.
In particular, at each step, the variable that gives the greatest additional improvement to the fit is added to the model.
3 Steps of the Forward Stepwise Selection algorithm
- Let M₀ denote the null model, which contains no predictors.
For k = 0, …, p-1:
- Consider all p - k models that augment the predictors in Mk with one additional predictor.
- Choose the best among these p-k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R².
- Select a single best model from among M₀, …, Mₚ using cross-validated prediction error, Cp (AIC), BIC, or adjusted R².
Backward Stepwise selection
Backward Stepwise selection begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.
3 Steps of the Backward Stepwise selection algorithm
- Let Mₚ denote the full model, which contains all p predictors.
For k = p, p-1, …, 1:
- Consider all k models that contain all but one of the predictors in Mk, for a total of k-1 predictors.
- Choose the best among these k models, and call it Mk+1. Here best is defined as having the smallest RSS and highest R2.
- Select a single best model from among M0, …, Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2.
Contrast forward and backward stepwise selection
Backward stepwise selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n<p></p>
What is the purpose of regularisation?
To create a less complex (parsimonious) model when dealing with a large number of features.
Define what is meant by a PGM
A Probabilistic Graphical Model is
… a probabilistic model
… for which a graph expresses the conditional dependence structure
… between a set of random variables.
2 Main types of PGMs
- Directed PGMs
- Undirected PGMs
3 Properties of Undirected PGMs
- The edges in the graph have no directions
- Only dependence is indicated.
- No causality can be inferred.
2 Properties of directed PGMs
- The edges in the graph have directions.
- The directions indicate causality and dependence.