Final Exam Past Exams Flashcards

1
Q

How is Occam’s Razor applied to Machine Learning?

A

If you have two comparable machine learning models, the simplest is the better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How many parameters does this model have?

A

d + 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between feature selection and feature extraction?

A

Feature selection is the selection of a subset of the features for building a model.

Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe PCA

A

PCA is the projection of the principle components onto a lower dimensional space. These principle components are the features that have the highest share of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What dimensionality reduction technique works the best?

A

LDA (red) works the best because it has the best separability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the k-means clustering algorithm work and what is the “solution” that it produces?

A

k-means works by initialising randomly k-centroids. Each data point is assigned to the nearest centroid. At each stage of the algorithm, the centroid locations are updated to be the mean of all the data points. Data points are assigned to their local centroid. This continues, until their is convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How would you assess if k-means clustering has worked properly?

A

If the centroids don’t move.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you assess if k-means has converged?

A

If all the data points are assigned to the same cluster in successive iterations, there’s convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you decide how many base learners when using bagging?

A

The number that can reduce the variance is an optimum number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the misclassification error of this dataset?

A

All non-diagonal elements divided by the total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain these models in terms of overfitting/underfitting.

Top left - degree 1

Top right - degree 2

Bottom left - degree 10

Bottom right - degree 25

A

The top left model is underfitting because no matter how much training data is added, it’s performance isn’t increasing.

The bottom models are overfitting because the test error is significantly higher than the training error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What’s the purpose of the validation set?

A

The purpose of the validation set is to use a set of examples used to tune the parameters of a classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

One commonly used learning algorithm for linear discriminant models and MLP is Gradient Descent. What’s the basic idea behind gradient descent?

A

Find function parameters (coefficients) that minimize a cost function as far as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In MLP, why are sigmoid functions used instead of hard-step functions?

A

Hard-step aren’t continuous, sigmoid are.

It is especially useful for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is used instead of hard-step functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In MLP what is the role of weight and bias?

A

A weight represent the strength of the connection between units. Decides how much influence the input node has on the output.

A bias ensures there’s always an activation in a node, even if the weight is zero. Makes a more flexible mlp model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Bayesian inference is a general alternative to maximum likelihood estimation that can be used to train a variety of models given data. Explain the main idea of Bayesian inference and compare with MLE. Your answer should mention the prior and posterior distributions over model parameters.

A

Bayesian estimation takes into account prior probability when assigning the likely parameters of a model.

MLE just tries to estimate the parameter which maximizes the likelihood function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If d= 10, how many parameters would a six degree polynomial have compared to the linear model?

A

61

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a hyper parameter in the context of Bayesian inference? Give an example.

A

The prior and the likelihoods, as well as the parameters of the prior distribution are all hyper parameters in the context of Bayesian Inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In machine learning, what is known as “generalization”?

A

Generalization is how well a trained model accurately classifies new data. An overfit model, doesn’t generalize well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

You are given a 5 dimensional dataset. After doing PCA, you discover that the 4th and 5th features have zero eignenvalues. What should you do?

A

They can be removed as they don’t contribute to the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What’s an expression for the percentage of the variance captured by the first principle component, where the eigenvalues and the covariance matrix of the data are lambda1 and lambda2?

A

lamda1 / lambda1 + lambda2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the total number of data points in this training set? How?

A

Sum all rows and columns.

24

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How many data points do we have in each class?

A

Sum the rows:

A: 5

B: 4

C: 15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the sum of the diagonal values in a confusion matrix?

A

The accurately predicted data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is overfitting?

A

Overfitting is where the model is too complex and fits the training data too well. This means its captured the data too well (noise/outliers) and can’t generalize well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

MLPs are trained using backpropogation. Discuss in detail some practical challenges to do with this.

A

Backpropogation uses gradient descent to optimise weights. The gradient descent algorithm is generally very slow because it requires small learning rates for stable learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Explain briefly what weight decay is. You can write it in terms of an equation or in words.

A

A number multiplied to the weight (<1) to stop the weight growing too large in a neural network. This is also prevent the neural network overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The training error is higher with weight decay than without weight decay. Does this matter?

A

Test error is the highest priority as it defines how well the model generalizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Discuss how weight decay works and what effect it can have on a model’s prediction (discriminant).

A

Weight decay is a regularization factor that penalizes a model overfitting. This leads to more simple disciminant functions for categorising data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Explain briefly how support vector machines utilise the concept of the margin in producing a classifier with good intended generalization performance?

A

Margin is the maximum distance between support vectors. By maximising this distance and using it to classify data, the probability that a new data point is classified correctly is increased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the general principle behind bayesian inference in machine learning.

A

Prior probability can be used to determine posterior probability and update a hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does the shaded region represent in both of these pictures

A

Figure 9(a) represents the prior distribution, which is our belief of what the model may be before any observations are made.

Figure 9(b) represents the posterior distribution, so our belief of what the model may be based on the prior and the observed data. In figure 9(b) the shaded area shrinks at each observed data point since we know the actual value of the model at those points, but as we move away from the observed values the possible model values grows and so the distribution (shaded area) also grows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the difference between classification and regression?

A

Regression is used for predicting continuous variables.

Classification is used for predicting categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What’s the difference between likelihood and discriminant approaches to classification?

A

Likelihood approaches make assumptions regarding the distribution of the data. The goal is to use bayes rule and model the posterior distribution of the classes given the training data. A test data point is assigned a class based on the highest posterior probability.

Discriminant approaches make no assumptions about the data and instead try and separate the classes of the data with a boundary(the discriminant). This is accomplished using some distance measure and by placing a hyperplane between the classes which allows classification if the point is within the boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What information does a confusion matrix give?

A

The FP, TP, FN and TN of a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the forward selection algorithm?

A

Build a model by adding features that have highest significance first. If the E value improves with adding the next feature, the next most significant feature is added next.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the steps of forward select algorithm in this example?

A

Check each column to see which row has two zeroes and a 1.

Look for the smallest value of these.

It is x2 in this case.

Check the combinations with x2, so (x1, x2) and (x2, x3).

If lower than the current e value. It’s the new case.

In this example, (x1, x2) < x2 -> new case.

Then check the next combination (x1, x2, x3) does it have a lower value than (x1, x2). No, so x1 and x2 are the features that would be selected in the algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What’s a limitation to Forward Selection?

A

It’s a greedy algorithm. It doesn’t consider the best subset overall.

If (x2, x3) are final with forward selection, it could have ignored (x1, x3).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the symbols represent in a GMM?

A

Gi the components/groups/clusters

P(Gi) priors

P( x | Gi) the component densities

P(x) the mixture model of the given data

k is the number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the parameters of the model?

A

The parameters of a GMM are the

k parameter,

the covariance matrix,

initial conditions and the regularization parameter.

The k parameter is the number of clusters

41
Q

What are the E and M steps in the EM algorithm?

A

E: Compute the expected value of your “hidden variables”, based on the current values of the parameters

M: Recompute the most likely value of your parameters based on the value of the hidden values and the observed data.

42
Q

Why might you get GMM that have different covariant matrices?

A

As the EM is started on random initial conditions there is a chance it will converge to a local optima. This is what can be seen in the results as the solution is non-convex and does not guarantee a global optimum

43
Q

What is a kernel density estimator? How does it use the data and what are the parameters?

A

A kernel density estimator is used to estimate an unknown probability distribution. Similar to a histogram. You place a kernel function on each data point and you sum them together. This gives a smooth distribution rather than a histogram model which is dependent on bin size.

44
Q

Explain how Bayes rule is used to produce a classifier from probability density estimation models given supervised data. From this, explain what is shown in Figure 2 (right).

A

The RHS is the posterior estimate which is the discriminant.

The PDE is used as the prior in Bayes Rule.

45
Q

How many weights in the model?

A

For each layer add 1 and multiply by the next layer to the right.

So [784+1] * 50, then [50+1] *50, [50+1]*20 … until the last layer.

46
Q

How do autoencoders reduce dimensionality?

A

Combing features in the encoding section of the autoencoder neural network.

47
Q

How is regularisation used in machine learning?

A

Regularisation is a technique which is vaguely based on Occams razor and restricts models from being overly complex and potentially overfitting

48
Q

What do the results in Figure y show regarding

(i) the effect of weight decay;
(ii) the effect of varying the number of hidden units?

A

i) Weight decay stops the model overfitting and increasing complexity. As more hidden layers are added, the weight decay reduces the weights of each node.

ii)

As the number of hidden units grow the weights are decreased having less and less impact using the second term in the equation.

49
Q

Figure 4 shows results that examine more closely the effect of the weight decay value on the test error. Explain the relationship between the trend of the results in Figure 4, under/over-fitting and model complexity

A

The model is overfitting when there is no weight decay, as weight decay increases this test error reduces. By using weight decay it is finding a balance between bias and variance which can be seen as the variance and bias reduces to find the optimal solution. As this parameter is increased further the test error increases again as the system is being limited to a simple system and mildly under fits the data.

50
Q

Explain what parts of the Adaboost need to be removed to make it a bagging algorithm.

A

Bagging is parallel, Boosting is iterative, using misclassified elements from a previous bag to improve the model. Remove anything except sampling and training.

In the test pseudocode, anything below “calculate” until “test”.

51
Q

How are Kernel functions used in kernel machines.

A

A kernel function allows non-linear data to be mapped in a higher dimensional space so it can be separated more easily.

52
Q

What are the hyper parameters in the context of bayesian inference?

A

The pdf’s for the likelihood and prior as well as the parameters of the prior.

53
Q

What’s the difference between MLE and Bayesian inference? Explain.

A

MLE is maximum likelihood estimation, a technique to find parameters which maximize the likelihood (not posterior probability) of data. It does not take prior probability into account likes Bayes does..

54
Q

How do they compute the cross-validation results?

A

The data is split into 10 subsets where each fold is run with 10 - 1 (9) subsets of the data. Each time the data is rotated until all subsets of data have been left out once.

During each run the error values are recorded and the mean and standard deviation is calculated from these outputs to be shown on the figure.

55
Q

What is the k value do in k-crossfold validation?

A

Larger K means less bias but higher variance and higher running time.

Small k means more bias but less variance

56
Q

In two or three sentences, outline the main, general concepts of how the Stochastic Neighbor Embedding and t-Distributed Stochastic Neighbour Embedding techniques for dimensionality reduction work.

A

The way t-SNE works is that it computes the pairwise likelihoods of generating data-points in the high dimensional space (P), and then tries to find an embedding that minimizes the KL divergence once you compute the pairwise likelihoods of generating data-points in the low dimensional space (Q).

57
Q

How many experiments?

A

Experiments =5*5*5*5*2=1250

5 for each parameter and 2 for the last.

58
Q

Categorise these as supervised or unsupervised?

(i) Quadratic discriminant analysis
(ii) Multidimensional scaling
(iii) Fisher’s Linear discriminant analysis
(iv) Gaussian mixture models

A

Quadratic Discriminant - Supervised

Multidimensional scaling - Unsupervised

Fisher’s Linear discriminant analysis - supervised

Gaussian mixture models - unsupervised

59
Q

Draw a diagram

A
60
Q

Whether or not you might panic (P) before an exam is probably influenced by whether or not you decide to attend lectures (A) and/or revise (R) the material. In turn, your attendance and revision is probably related to whether or not the course was boring (B).

Write down the factorization of the joint probability distribution that the network implies

A

P(A,B,R,P)= P(B) *P(A|B) *P(R|B) *P(P|A,R)

Probability of Boring * Probability (attend lecture given it was boring) * Probability(Revise given it was boring) * Probability (Panic given you attend lectures and revised)

61
Q

It is impossible to converge to a non-global minimum when training a Support Vector Machine. T or F?

A

True

62
Q

Overfitting is likely to result in high training set error, but low test set error

A

False

63
Q

Bayes Rule can be used to produce a classifier from a non-parametric density estimator

A

True

64
Q

The results of a feature selection algorithm such as forward selection will depend on the underlying classifier or regression model used.

A

True

65
Q

Fisher’s Linear Discriminant Analysis is a dimensionality reduction technique which requires a supervised (labelled) dataset

A

True

66
Q

In the k-Nearest Neighbours classifier, a large value of k will make the discriminant highly sensitive to individual outliers in the data.

A

False

67
Q

Occam’s razor was a popular technique for training neural networks in the 1980s, but fell out of fashion when researchers realised that it was blunt

A

False

68
Q

(i) What optimisation algorithm is being used inside t-SNE?

A

t-sne uses gradient descent

69
Q

Explain briefly the intuition behind t-SNE: what does it try to do with the low-dimensional data representation?

A

t-SNE aims to represent as much of the structure in the data as possible in a lower dimension map (that can be represented as a scatterplot). It focuses particularly on retain the local and global structures in the data, which many other techniques struggle with

70
Q

In Figure 3, what symbol is used to denote the objective function for the optimisation problem?

A

C. C for cost

71
Q

Using the notation from Figure 3, how would you indicate the dataset in the reduced (i.e. low dimensional) space after 100 iterations?

A

y^(100)

72
Q

Suppose you ran this algorithm on a dataset for some learning rate value and then increased the learning rate and repeated this process several times. How might you identify when the learning rate had become too large? Explain in some detail. You might like to draw a graph to assist with your explanation.

A

Large learning rate means that gradient descent doesn’t converge well. Can’t get useful information for plotting.

73
Q

The mean shift clustering algorithm, in contrast to k-means, does not require the user to specify the number of clusters in advance. Discuss the main factors that will effect the number of clusters that the mean shift algorithm converges to?

A

The size of the window and the mean of the window determine how mean shift converges.

74
Q

What is the intuition given (in terms of feature extraction) for the success of deep neural networks (i.e. models with a large number of hidden units)?

A

Deep neural networks include implicit feature-extraction, the complexity of these feature-extraction can increase with respect to number of neurons and layers in the network, creating increasingly abstract representations of the underlying structure in the data.

75
Q

P(A,B,C,D,E,F,G,H) = P(A) P(B|A) P(C|A) P(D|B) P(E|B) P(F|C) P(G|C) P(H|F)

Draw this Bayesian Network.

A

The element on the right points to the element on left so P(B|A) means A->B. A will be the root node of this Bayes network.

76
Q

P(A,B,C,D,E,F,G,H) = P(A) P(B|A) P(C|A) P(D|B) P(E|B) P(F|C) P(G|C) P(H|F)

Calculate the number of values that are needed to be stored for the conditional probability tables of this Bayesian network.

A

2 for each P(x|y)

1 for P(z)

So (2 x 7) + 1

= 15

77
Q

P(A,B,C,D,E,F,G,H) = P(A) P(B|A) P(C|A) P(D|B) P(E|B) P(F|C) P(G|C) P(H|F)

Calculate the number of values that are needed to be stored to specify the full joint distribution (i.e. an 8-dimensional Bernoulli distribution).

A

of variables = 8 (A-H)

full joint distribution = 2^8

78
Q

A random forest is an example of an ensemble learning method. (i) What is the base learner in a random forest?

A

Decision tree

79
Q

(ii) Explain briefly how ensemble learning works (training and testing).

A

Ensemble learning works by producing a distribution of simple ML models on subsets of the original data. The varying simple ML models are then combined into one “aggregated” model.

80
Q

What is the main potential benefit of ensemble learning, in terms of the bias/variance error decomposition? Explain how ensemble learning achieves this benefit.

A

Bagging: reduces the variance the more models being added.

Boosting: Boosting reduces the bias of the overall model by focusing on previously misclassified data. This is done by increasing the chance that a previously misclassified point is resampled and is appropriately classified.

81
Q

Explain what the curve and shaded region in Figure 6 represent. What does the model give as an output for a test input?

A

Shaded region is how confident the model is around a particular input, x. This is the posterior output where every subset of x is represented by multivariate gaussian distribution.

GP is a method of regression given a test input the model will output a multivariate gaussian distribution for that point which will give insight into if that point.

82
Q

Explain why the curve and shaded region in Figure 6 have this shape by referring to the data and the covariance function of the Gaussian Process

A

Before figure 6 the model is represented over the function as a prior and following the observation of some data figure 6 is generated which is the posterior.

As more data is seen the confidence of that point is increased which results in a narrower shaded region at that point. The covariance function varies how the function space responds to the length of the characteristic length-scale. I.e. it is a smoother function or a more noisy function.

83
Q

The pseudocode algorithms in this question is taken from the Alpaydin textbook.

(a) In the pseudocode, what does N represent?

A

The number of data points

84
Q

(b) In the pseudocode, what does d represent?

A

The number of features

85
Q

(c) As stated in the caption, Figure 11.3 is the online version of Figure 10.8. Explain what this means in terms of gradient descent optimisation.

A

Online version: May escape from local minima, and converges faster in gradient descent.

86
Q

What is the learning rate parameter in these algorithms?

If the learning rate is too large, what effects might be observed during training?

A

The n variable.

if the learning rate is too high then the algorithm may not converge to the local optima.

87
Q

Figure 11.11 shows pseudocode for backpropagation in a multi-layer perceptron with a single hidden layer. What symbol represents the weights connecting the inputs to hidden units?

A

The wth value.

88
Q

If we modify the code in Figure 11.11 (as the caption suggests) to perform two-class classification, which lines need to be modified? Write out the modifications

A

Change 6,7,8 to y_i = sigmoid(v^T_i z)

89
Q

Does Figure 11.11 describe online or batch backpropagation? Explain your answer by referring to the pseudocode

A

online (stochastic because random order), the update in weight of output and hidden layer is done for each datapoint instead of whole iteration of the dataset.

90
Q

(i) What do the “ReLU” rows in this output represent?

A

Using the Rectified Linear Unit Activation function

91
Q

(ii) Which layers contain adaptive weights that are modified during training?

A

Convolution layers => adaptive filters, Fully Connected => adaptive weights, this is all.

92
Q

(iii) What happens in the “Dropout” layers?

A

iii) Randomly discard a percentage number of nodes.

93
Q

Bayesian networks make use of a directed acyclic graph to describe a multivariate probability distribution

A

True

94
Q

Boosting can be used to build an ensemble of learners for either classification or regression problems

A

True

95
Q

Support vector machines build a model of the probability distribution of each class in a classification problem

A

False

96
Q

Bayes Rule can be used to produce a classifier from a non-parametric density estimator.

A

True

97
Q

Classifiers with a high predictive variance will always have a low predictive bias

A

True

98
Q

Using the general idea of Bayesian learning, explain the relationship between the hyperparameter values and the prior distribution in a Gaussian Process regression model.

A

The covariance function specifies the prior in a GP. Hyper parameters are parameters of the covariance function. So the hyperparameters change the prior. This describes the types of functions we expect to see before observing any dat

99
Q

Which of the Gaussian Processes in Figures 9-11 has the highest training set error? Which Gaussian Process do you think will have the highest test set error? Explain why briefly

A

Training set error - biggest difference between GP mean (line) and data.

Test error - Probably the last one will have worse test error (overfitting), highly non-smooth.