Final Exam Past Exams Flashcards
How is Occam’s Razor applied to Machine Learning?
If you have two comparable machine learning models, the simplest is the better.
How many parameters does this model have?
d + 1
What is the difference between feature selection and feature extraction?
Feature selection is the selection of a subset of the features for building a model.
Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features
Describe PCA
PCA is the projection of the principle components onto a lower dimensional space. These principle components are the features that have the highest share of the variance.
What dimensionality reduction technique works the best?
LDA (red) works the best because it has the best separability.
How does the k-means clustering algorithm work and what is the “solution” that it produces?
k-means works by initialising randomly k-centroids. Each data point is assigned to the nearest centroid. At each stage of the algorithm, the centroid locations are updated to be the mean of all the data points. Data points are assigned to their local centroid. This continues, until their is convergence.
How would you assess if k-means clustering has worked properly?
If the centroids don’t move.
How would you assess if k-means has converged?
If all the data points are assigned to the same cluster in successive iterations, there’s convergence
How do you decide how many base learners when using bagging?
The number that can reduce the variance is an optimum number.
What is the misclassification error of this dataset?
All non-diagonal elements divided by the total
Explain these models in terms of overfitting/underfitting.
Top left - degree 1
Top right - degree 2
Bottom left - degree 10
Bottom right - degree 25
The top left model is underfitting because no matter how much training data is added, it’s performance isn’t increasing.
The bottom models are overfitting because the test error is significantly higher than the training error.
What’s the purpose of the validation set?
The purpose of the validation set is to use a set of examples used to tune the parameters of a classifier
One commonly used learning algorithm for linear discriminant models and MLP is Gradient Descent. What’s the basic idea behind gradient descent?
Find function parameters (coefficients) that minimize a cost function as far as possible.
In MLP, why are sigmoid functions used instead of hard-step functions?
Hard-step aren’t continuous, sigmoid are.
It is especially useful for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is used instead of hard-step functions.
In MLP what is the role of weight and bias?
A weight represent the strength of the connection between units. Decides how much influence the input node has on the output.
A bias ensures there’s always an activation in a node, even if the weight is zero. Makes a more flexible mlp model.
Bayesian inference is a general alternative to maximum likelihood estimation that can be used to train a variety of models given data. Explain the main idea of Bayesian inference and compare with MLE. Your answer should mention the prior and posterior distributions over model parameters.
Bayesian estimation takes into account prior probability when assigning the likely parameters of a model.
MLE just tries to estimate the parameter which maximizes the likelihood function.
If d= 10, how many parameters would a six degree polynomial have compared to the linear model?
61
What is a hyper parameter in the context of Bayesian inference? Give an example.
The prior and the likelihoods, as well as the parameters of the prior distribution are all hyper parameters in the context of Bayesian Inference.
In machine learning, what is known as “generalization”?
Generalization is how well a trained model accurately classifies new data. An overfit model, doesn’t generalize well.
You are given a 5 dimensional dataset. After doing PCA, you discover that the 4th and 5th features have zero eignenvalues. What should you do?
They can be removed as they don’t contribute to the variance.
What’s an expression for the percentage of the variance captured by the first principle component, where the eigenvalues and the covariance matrix of the data are lambda1 and lambda2?
lamda1 / lambda1 + lambda2
What is the total number of data points in this training set? How?
Sum all rows and columns.
24
How many data points do we have in each class?
Sum the rows:
A: 5
B: 4
C: 15
What is the sum of the diagonal values in a confusion matrix?
The accurately predicted data points.
What is overfitting?
Overfitting is where the model is too complex and fits the training data too well. This means its captured the data too well (noise/outliers) and can’t generalize well.
MLPs are trained using backpropogation. Discuss in detail some practical challenges to do with this.
Backpropogation uses gradient descent to optimise weights. The gradient descent algorithm is generally very slow because it requires small learning rates for stable learning.
Explain briefly what weight decay is. You can write it in terms of an equation or in words.
A number multiplied to the weight (<1) to stop the weight growing too large in a neural network. This is also prevent the neural network overfitting.
The training error is higher with weight decay than without weight decay. Does this matter?
Test error is the highest priority as it defines how well the model generalizes
Discuss how weight decay works and what effect it can have on a model’s prediction (discriminant).
Weight decay is a regularization factor that penalizes a model overfitting. This leads to more simple disciminant functions for categorising data.
Explain briefly how support vector machines utilise the concept of the margin in producing a classifier with good intended generalization performance?
Margin is the maximum distance between support vectors. By maximising this distance and using it to classify data, the probability that a new data point is classified correctly is increased.
What is the general principle behind bayesian inference in machine learning.
Prior probability can be used to determine posterior probability and update a hypothesis.
What does the shaded region represent in both of these pictures
Figure 9(a) represents the prior distribution, which is our belief of what the model may be before any observations are made.
Figure 9(b) represents the posterior distribution, so our belief of what the model may be based on the prior and the observed data. In figure 9(b) the shaded area shrinks at each observed data point since we know the actual value of the model at those points, but as we move away from the observed values the possible model values grows and so the distribution (shaded area) also grows.
What is the difference between classification and regression?
Regression is used for predicting continuous variables.
Classification is used for predicting categorical variables.
What’s the difference between likelihood and discriminant approaches to classification?
Likelihood approaches make assumptions regarding the distribution of the data. The goal is to use bayes rule and model the posterior distribution of the classes given the training data. A test data point is assigned a class based on the highest posterior probability.
Discriminant approaches make no assumptions about the data and instead try and separate the classes of the data with a boundary(the discriminant). This is accomplished using some distance measure and by placing a hyperplane between the classes which allows classification if the point is within the boundary.
What information does a confusion matrix give?
The FP, TP, FN and TN of a dataset.
What is the forward selection algorithm?
Build a model by adding features that have highest significance first. If the E value improves with adding the next feature, the next most significant feature is added next.
What are the steps of forward select algorithm in this example?
Check each column to see which row has two zeroes and a 1.
Look for the smallest value of these.
It is x2 in this case.
Check the combinations with x2, so (x1, x2) and (x2, x3).
If lower than the current e value. It’s the new case.
In this example, (x1, x2) < x2 -> new case.
Then check the next combination (x1, x2, x3) does it have a lower value than (x1, x2). No, so x1 and x2 are the features that would be selected in the algorithm.
What’s a limitation to Forward Selection?
It’s a greedy algorithm. It doesn’t consider the best subset overall.
If (x2, x3) are final with forward selection, it could have ignored (x1, x3).
What are the symbols represent in a GMM?
Gi the components/groups/clusters
P(Gi) priors
P( x | Gi) the component densities
P(x) the mixture model of the given data
k is the number of clusters