Final Exam Past Exams Flashcards

Question 1

Q

How is Occam’s Razor applied to Machine Learning?

Answer

A

If you have two comparable machine learning models, the simplest is the better.

Question 2

Q

How many parameters does this model have?

Question 3

Q

What is the difference between feature selection and feature extraction?

Answer

A

Feature selection is the selection of a subset of the features for building a model.

Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features

Question 4

Q

Describe PCA

Answer

A

PCA is the projection of the principle components onto a lower dimensional space. These principle components are the features that have the highest share of the variance.

Question 5

Q

What dimensionality reduction technique works the best?

Answer

A

LDA (red) works the best because it has the best separability.

Question 6

Q

How does the k-means clustering algorithm work and what is the “solution” that it produces?

Answer

A

k-means works by initialising randomly k-centroids. Each data point is assigned to the nearest centroid. At each stage of the algorithm, the centroid locations are updated to be the mean of all the data points. Data points are assigned to their local centroid. This continues, until their is convergence.

Question 7

Q

How would you assess if k-means clustering has worked properly?

Answer

A

If the centroids don’t move.

Question 8

Q

How would you assess if k-means has converged?

Answer

A

If all the data points are assigned to the same cluster in successive iterations, there’s convergence

Question 9

Q

How do you decide how many base learners when using bagging?

Answer

A

The number that can reduce the variance is an optimum number.

Question 10

Q

What is the misclassification error of this dataset?

Answer

A

All non-diagonal elements divided by the total

Question 11

Q

Explain these models in terms of overfitting/underfitting.

Top left - degree 1

Top right - degree 2

Bottom left - degree 10

Bottom right - degree 25

Answer

A

The top left model is underfitting because no matter how much training data is added, it’s performance isn’t increasing.

The bottom models are overfitting because the test error is significantly higher than the training error.

Question 12

Q

What’s the purpose of the validation set?

Answer

A

The purpose of the validation set is to use a set of examples used to tune the parameters of a classifier

Question 13

Q

One commonly used learning algorithm for linear discriminant models and MLP is Gradient Descent. What’s the basic idea behind gradient descent?

Answer

A

Find function parameters (coefficients) that minimize a cost function as far as possible.

Question 14

Q

In MLP, why are sigmoid functions used instead of hard-step functions?

Answer

A

Hard-step aren’t continuous, sigmoid are.

It is especially useful for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is used instead of hard-step functions.

Question 15

Q

In MLP what is the role of weight and bias?

Answer

A

A weight represent the strength of the connection between units. Decides how much influence the input node has on the output.

A bias ensures there’s always an activation in a node, even if the weight is zero. Makes a more flexible mlp model.

Question 16

Q

Bayesian inference is a general alternative to maximum likelihood estimation that can be used to train a variety of models given data. Explain the main idea of Bayesian inference and compare with MLE. Your answer should mention the prior and posterior distributions over model parameters.

Answer

A

Bayesian estimation takes into account prior probability when assigning the likely parameters of a model.

MLE just tries to estimate the parameter which maximizes the likelihood function.

Question 17

Q

If d= 10, how many parameters would a six degree polynomial have compared to the linear model?

Question 18

Q

What is a hyper parameter in the context of Bayesian inference? Give an example.

Answer

A

The prior and the likelihoods, as well as the parameters of the prior distribution are all hyper parameters in the context of Bayesian Inference.

Question 19

Q

In machine learning, what is known as “generalization”?

Answer

A

Generalization is how well a trained model accurately classifies new data. An overfit model, doesn’t generalize well.

Question 20

Q

You are given a 5 dimensional dataset. After doing PCA, you discover that the 4th and 5th features have zero eignenvalues. What should you do?

Answer

A

They can be removed as they don’t contribute to the variance.

Question 21

Q

What’s an expression for the percentage of the variance captured by the first principle component, where the eigenvalues and the covariance matrix of the data are lambda1 and lambda2?

Answer

A

lamda1 / lambda1 + lambda2

Question 22

Q

What is the total number of data points in this training set? How?

Answer

A

Sum all rows and columns.

24

Question 23

Q

How many data points do we have in each class?

Answer

A

Sum the rows:

A: 5

B: 4

C: 15

Question 24

Q

What is the sum of the diagonal values in a confusion matrix?

Answer

A

The accurately predicted data points.

Question 25

Q

What is overfitting?

Answer

A

Overfitting is where the model is too complex and fits the training data too well. This means its captured the data too well (noise/outliers) and can’t generalize well.

Question 26

Q

MLPs are trained using backpropogation. Discuss in detail some practical challenges to do with this.

Answer

A

Backpropogation uses gradient descent to optimise weights. The gradient descent algorithm is generally very slow because it requires small learning rates for stable learning.

Question 27

Q

Explain briefly what weight decay is. You can write it in terms of an equation or in words.

Answer

A

A number multiplied to the weight (<1) to stop the weight growing too large in a neural network. This is also prevent the neural network overfitting.

Question 28

Q

The training error is higher with weight decay than without weight decay. Does this matter?

Answer

A

Test error is the highest priority as it defines how well the model generalizes

Question 29

Q

Discuss how weight decay works and what effect it can have on a model’s prediction (discriminant).

Answer

A

Weight decay is a regularization factor that penalizes a model overfitting. This leads to more simple disciminant functions for categorising data.

Question 30

Q

Explain briefly how support vector machines utilise the concept of the margin in producing a classifier with good intended generalization performance?

Answer

A

Margin is the maximum distance between support vectors. By maximising this distance and using it to classify data, the probability that a new data point is classified correctly is increased.

Question 31

Q

What is the general principle behind bayesian inference in machine learning.

Answer

A

Prior probability can be used to determine posterior probability and update a hypothesis.

Question 32

Q

What does the shaded region represent in both of these pictures

Answer

A

Figure 9(a) represents the prior distribution, which is our belief of what the model may be before any observations are made.

Figure 9(b) represents the posterior distribution, so our belief of what the model may be based on the prior and the observed data. In figure 9(b) the shaded area shrinks at each observed data point since we know the actual value of the model at those points, but as we move away from the observed values the possible model values grows and so the distribution (shaded area) also grows.

Question 33

Q

What is the difference between classification and regression?

Answer

A

Regression is used for predicting continuous variables.

Classification is used for predicting categorical variables.

Question 34

Q

What’s the difference between likelihood and discriminant approaches to classification?

Answer

A

Likelihood approaches make assumptions regarding the distribution of the data. The goal is to use bayes rule and model the posterior distribution of the classes given the training data. A test data point is assigned a class based on the highest posterior probability.

Discriminant approaches make no assumptions about the data and instead try and separate the classes of the data with a boundary(the discriminant). This is accomplished using some distance measure and by placing a hyperplane between the classes which allows classification if the point is within the boundary.

Question 35

Q

What information does a confusion matrix give?

Answer

A

The FP, TP, FN and TN of a dataset.

Question 36

Q

What is the forward selection algorithm?

Answer

A

Build a model by adding features that have highest significance first. If the E value improves with adding the next feature, the next most significant feature is added next.

Question 37

Q

What are the steps of forward select algorithm in this example?

Answer

A

Check each column to see which row has two zeroes and a 1.

Look for the smallest value of these.

It is x2 in this case.

Check the combinations with x2, so (x1, x2) and (x2, x3).

If lower than the current e value. It’s the new case.

In this example, (x1, x2) < x2 -> new case.

Then check the next combination (x1, x2, x3) does it have a lower value than (x1, x2). No, so x1 and x2 are the features that would be selected in the algorithm.

Question 38

Q

What’s a limitation to Forward Selection?

Answer

A

It’s a greedy algorithm. It doesn’t consider the best subset overall.

If (x2, x3) are final with forward selection, it could have ignored (x1, x3).

Question 39

Q

What are the symbols represent in a GMM?

Answer

A

Gi the components/groups/clusters

P(Gi) priors

P( x | Gi) the component densities

P(x) the mixture model of the given data

k is the number of clusters

Question 40

Q

What are the parameters of the model?

Answer

A

The parameters of a GMM are the

k parameter,

the covariance matrix,

initial conditions and the regularization parameter.

The k parameter is the number of clusters

Question 41

Q

What are the E and M steps in the EM algorithm?

Answer

A

E: Compute the expected value of your “hidden variables”, based on the current values of the parameters

M: Recompute the most likely value of your parameters based on the value of the hidden values and the observed data.

Question 42

Q

Why might you get GMM that have different covariant matrices?

Answer

A

As the EM is started on random initial conditions there is a chance it will converge to a local optima. This is what can be seen in the results as the solution is non-convex and does not guarantee a global optimum

Question 43

Q

What is a kernel density estimator? How does it use the data and what are the parameters?

Answer

A

A kernel density estimator is used to estimate an unknown probability distribution. Similar to a histogram. You place a kernel function on each data point and you sum them together. This gives a smooth distribution rather than a histogram model which is dependent on bin size.

Question 44

Q

Explain how Bayes rule is used to produce a classifier from probability density estimation models given supervised data. From this, explain what is shown in Figure 2 (right).

Answer

A

The RHS is the posterior estimate which is the discriminant.

The PDE is used as the prior in Bayes Rule.

Question 45

Q

How many weights in the model?

Answer

A

For each layer add 1 and multiply by the next layer to the right.

So [784+1] * 50, then [50+1] *50, [50+1]*20 … until the last layer.

Question 46

Q

How do autoencoders reduce dimensionality?

Answer

A

Combing features in the encoding section of the autoencoder neural network.

Question 47

Q

How is regularisation used in machine learning?

Answer

A

Regularisation is a technique which is vaguely based on Occams razor and restricts models from being overly complex and potentially overfitting

Question 48

Q

What do the results in Figure y show regarding

(i) the effect of weight decay;
(ii) the effect of varying the number of hidden units?

Answer

A

i) Weight decay stops the model overfitting and increasing complexity. As more hidden layers are added, the weight decay reduces the weights of each node.

ii)

As the number of hidden units grow the weights are decreased having less and less impact using the second term in the equation.

Question 49

Q

Figure 4 shows results that examine more closely the effect of the weight decay value on the test error. Explain the relationship between the trend of the results in Figure 4, under/over-fitting and model complexity

Answer

A

The model is overfitting when there is no weight decay, as weight decay increases this test error reduces. By using weight decay it is finding a balance between bias and variance which can be seen as the variance and bias reduces to find the optimal solution. As this parameter is increased further the test error increases again as the system is being limited to a simple system and mildly under fits the data.

Question 50

Q

Explain what parts of the Adaboost need to be removed to make it a bagging algorithm.

Answer

A

Bagging is parallel, Boosting is iterative, using misclassified elements from a previous bag to improve the model. Remove anything except sampling and training.

In the test pseudocode, anything below “calculate” until “test”.

Question 51

Q

How are Kernel functions used in kernel machines.

Answer

A

A kernel function allows non-linear data to be mapped in a higher dimensional space so it can be separated more easily.

Question 52

Q

What are the hyper parameters in the context of bayesian inference?

Answer

A

The pdf’s for the likelihood and prior as well as the parameters of the prior.

Question 53

Q

What’s the difference between MLE and Bayesian inference? Explain.

Answer

A

MLE is maximum likelihood estimation, a technique to find parameters which maximize the likelihood (not posterior probability) of data. It does not take prior probability into account likes Bayes does..

Question 54

Q

How do they compute the cross-validation results?

Answer

A

The data is split into 10 subsets where each fold is run with 10 - 1 (9) subsets of the data. Each time the data is rotated until all subsets of data have been left out once.

During each run the error values are recorded and the mean and standard deviation is calculated from these outputs to be shown on the figure.

Question 55

Q

What is the k value do in k-crossfold validation?

Answer

A

Larger K means less bias but higher variance and higher running time.

Small k means more bias but less variance

Question 56

Q

In two or three sentences, outline the main, general concepts of how the Stochastic Neighbor Embedding and t-Distributed Stochastic Neighbour Embedding techniques for dimensionality reduction work.

Answer

A

The way t-SNE works is that it computes the pairwise likelihoods of generating data-points in the high dimensional space (P), and then tries to find an embedding that minimizes the KL divergence once you compute the pairwise likelihoods of generating data-points in the low dimensional space (Q).

Question 57

Q

How many experiments?

Answer

A

Experiments =5*5*5*5*2=1250

5 for each parameter and 2 for the last.

Question 58

Q

Categorise these as supervised or unsupervised?

(i) Quadratic discriminant analysis
(ii) Multidimensional scaling
(iii) Fisher’s Linear discriminant analysis
(iv) Gaussian mixture models

Answer

A

Quadratic Discriminant - Supervised

Multidimensional scaling - Unsupervised

Fisher’s Linear discriminant analysis - supervised

Gaussian mixture models - unsupervised

Question 59

Q

Draw a diagram

Question 60

Q

Whether or not you might panic (P) before an exam is probably influenced by whether or not you decide to attend lectures (A) and/or revise (R) the material. In turn, your attendance and revision is probably related to whether or not the course was boring (B).

Write down the factorization of the joint probability distribution that the network implies

Answer

A

P(A,B,R,P)= P(B) *P(A|B) *P(R|B) *P(P|A,R)

Probability of Boring * Probability (attend lecture given it was boring) * Probability(Revise given it was boring) * Probability (Panic given you attend lectures and revised)

Question 61

Q

It is impossible to converge to a non-global minimum when training a Support Vector Machine. T or F?

Question 62

Q

Overfitting is likely to result in high training set error, but low test set error

Question 63

Q

Bayes Rule can be used to produce a classifier from a non-parametric density estimator

Question 64

Q

The results of a feature selection algorithm such as forward selection will depend on the underlying classifier or regression model used.

Answer 58

A

t-sne uses gradient descent

Answer 59

A

t-SNE aims to represent as much of the structure in the data as possible in a lower dimension map (that can be represented as a scatterplot). It focuses particularly on retain the local and global structures in the data, which many other techniques struggle with

Answer 60

A

C. C for cost

Answer 61

A

Large learning rate means that gradient descent doesn’t converge well. Can’t get useful information for plotting.

Answer 62

A

The size of the window and the mean of the window determine how mean shift converges.

Answer 63

A

Deep neural networks include implicit feature-extraction, the complexity of these feature-extraction can increase with respect to number of neurons and layers in the network, creating increasingly abstract representations of the underlying structure in the data.

Answer 64

A

The element on the right points to the element on left so P(B|A) means A->B. A will be the root node of this Bayes network.

Answer 65

A

2 for each P(x|y)

1 for P(z)

So (2 x 7) + 1

= 15

Answer 66

A

of variables = 8 (A-H)

full joint distribution = 2^8

Answer 67

A

Decision tree

Answer 68

A

Ensemble learning works by producing a distribution of simple ML models on subsets of the original data. The varying simple ML models are then combined into one “aggregated” model.

Answer 69

A

Bagging: reduces the variance the more models being added.

Boosting: Boosting reduces the bias of the overall model by focusing on previously misclassified data. This is done by increasing the chance that a previously misclassified point is resampled and is appropriately classified.

Answer 70

A

Shaded region is how confident the model is around a particular input, x. This is the posterior output where every subset of x is represented by multivariate gaussian distribution.

GP is a method of regression given a test input the model will output a multivariate gaussian distribution for that point which will give insight into if that point.

Answer 71

A

Before figure 6 the model is represented over the function as a prior and following the observation of some data figure 6 is generated which is the posterior.

As more data is seen the confidence of that point is increased which results in a narrower shaded region at that point. The covariance function varies how the function space responds to the length of the characteristic length-scale. I.e. it is a smoother function or a more noisy function.

Answer 72

A

The number of data points

Answer 73

A

The number of features

Answer 74

A

Online version: May escape from local minima, and converges faster in gradient descent.

Answer 75

A

The n variable.

if the learning rate is too high then the algorithm may not converge to the local optima.

Answer 76

A

The wth value.

Answer 77

A

Change 6,7,8 to y_i = sigmoid(v^T_i z)

Answer 78

A

online (stochastic because random order), the update in weight of output and hidden layer is done for each datapoint instead of whole iteration of the dataset.

Answer 79

A

Using the Rectified Linear Unit Activation function

Answer 80

A

Convolution layers => adaptive filters, Fully Connected => adaptive weights, this is all.

Answer 81

A

iii) Randomly discard a percentage number of nodes.

Answer 82

A

The covariance function specifies the prior in a GP. Hyper parameters are parameters of the covariance function. So the hyperparameters change the prior. This describes the types of functions we expect to see before observing any dat

Answer 83

A

Training set error - biggest difference between GP mean (line) and data.

Test error - Probably the last one will have worse test error (overfitting), highly non-smooth.