Intro Flashcards

1
Q

Why is machine learning popular?

A

-Lots of data available
-current control theory methods struggle to solve large scale complex problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the types of supervised learning?

A

regression and classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the types of unsupervised learning?

A

clustering and dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the types of reinforcement learning?

A

Value iteration and policy iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is supervised learning?

A

a function that maps an input to an output based on labelled example input output pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is unsupervised learning?

A

an algorithm that learns patterns from un labelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the key difference between regression and classification?

A

in regression the data is continuous whereas discrete data is used for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does regression work?

A

find a function that minimises a cost function (most often mean squared error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe a nearest neighbour model?

A

individual data point is grouped depending on proximity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe a piecewise linear model

A

data follows different linear trends over different regions of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some model types?

A

Linear, low order polynomial, high order polynomial, piecewise linear, nearest neighbour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When does overfitting occur?

A
  • when a model fits the data set too well and is unable to generalise
  • low density of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a characteristic of overfitting?

A

oversensitivity to measurement noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can overfitting be avoided?

A

do not use a model that is more complicated than required (Occam’s razor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a white box model?

A

-increased system information
-low model uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a black box model?

A

-decreased system information
-high model uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is inference?

A

the process in which prediction is made

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the expected mean square error of the prediction depend on?

A

bias and variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is meant by high bias?

A

model fails to capture the underlying structure of the data (underfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is meant by high variance?

A

model is sensitive to small fluctuations in the data (overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

when is variance high?

A

in complex models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is the bias-variance trade-off?

A

If biased is increased then variance decreases and vice versa. Therefore need to minimise both bias and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is meant by error?

A

the error between the true value and the predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what happens in simple linear regression?

A

identify a line of best fit y=a0+a1x+err, where a0 and a1 need to be determined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you find a0 and a1 that minimises the sums of square errors of residuals?

A

-find stationary points by taking partial derivatives of sum of square residuals
-set to zero to solve optimization problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the ordinary least squares (OLS) method?

A

approximately select X, Y an theta, and solve
theta = (XT X)^-1(XT Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the X matrix known as?

A

design matrix
regressor matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

When does the OLS method not work for linear regression?

A
  • If XTX is not invertible, it cannot be solved
  • not invertible if the the OLS problem has non-unique solutions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is meant by collinearity?

A

two sequences of data are said to be colinear if there exists k not=0 such that x1i=kx2i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What occurs in the OLS if there is a pair of feature data sequences that are collinear?

A

the associated OLS has infinite optimal solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

when does collinearity occur?

A

when two feature variables are highly correlated providing redundant information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How can you deal with collinearity in data?

A
  • increase the amount of training data
  • find and remove highly correlated data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are some issues with the OLS method?

A
  • computing inverse of XTX can be computationally expensive
  • if the data is close to being collinear then the OLS solution becomes very sensitive to small changes in the training data set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Which models can be fit using the OLS method?

A

those linear in parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can the goodness of fit of a regression model be assessed?

A

Using R^2 coefficient

36
Q

what does a R^2 value approximately equal to 1 indicate?

A

sum of squared error is small, therefore a good model

37
Q

What does a negative R^2 value mean?

A

very bad model

38
Q

what does a small positive R^2 value indicate?

A

bad model

39
Q

What is the Weierstrass Approximation Theorem?

A

For any continuous function f, with continuous interval [a b] and E>0, there exists a polynomial p, such that sup|f(x)-p(x)|<E

40
Q

What is regularization used for?

A
  • Prevent overfitting to training data
  • remove user choice from a model
41
Q

Why do we use regularization?

A
  • Increasing model parameters will fit the training data more accurately, but unnecessary terms can cause overfitting.
42
Q

What is classification?

A

Supervised learning of discrete data

43
Q

What does a Bayes classifier do differently?

A

It constructs a probability distribution instead of a model

44
Q

What is a perceptron?

A

An algorithm that describes the classification rules for hyperplanes in supervised learning of binary classifiers.
The simplest Neural Network

45
Q

What issues arise in non-linear regressions?

A

no unique solution
settle for approximate solutions (newton raphson method)

46
Q

What does gradient descent do?

A

Identifies local minima, solving OLS

47
Q

What are the issues with high dimension feature space?

A

-Hard to visualise data in large dimensions
-OLS fails

48
Q

How do you find Principal Components?

A
  1. Compute centred data matrix X~
  2. Compute X~^T X~
  3. Find orthonormal eigenvector of X~^T X~ with the largest eigenvalue
49
Q

What are the principal components?

A

-The orthonormal vector direction with the largest variation in data
- The orthonormal vectors that define a linear manifold giving minimal reconstruction error
- The orthonormal eigenvectors of X~^T X~ with the largest eigenvalues
- q columns of W corresponding to the q largest squared singular value where singular value decomposition of the regressor matrix is X=UEW^T

50
Q

What is clustering?

A

A class of unsupervised learning methods that separates data into groups by similarity

51
Q

What does the Weierstrass Theorem do? (in simpler terms)

A

provides a guarantee that certain functions can be approximated to arbitrarily high accuracy by a finite degree polynomial
provided the function is defined over a finite interval

52
Q

when is a matrix non invertible?

A

when the determinant is equal to zero

53
Q

What are the disadvantages of using polynomials in modelling?

A
  • Many coefficients and parameters in high degree polynomials
  • No guarantee of approximating discontinuous functions e.g tan(x)
  • Slow convergence rates
    Polynomials tend towards infinity, which in unnatural system behaviour
54
Q

How does regularization prevent overfitting?

A

It penalises unnecessary non-zero parameters to help prevent the model becoming oversensitive to noise in the training data

55
Q

How do we select lambda in regularization?

A

randomly sample to find optimal lambda
performance typically evaluated through cross-validation

56
Q

What is the perceptron equation?

A

f(x)=sgn(w^Tx)

57
Q

What is the structure of the perceptron?

A

x, w, sum everything, step, output

58
Q

How do you form a non-linear decision boundary?

A

Add more basis functions

59
Q

What is the equation for a support vector machine?

A

𝑓(𝑥;𝜃)=𝜃0 + ∑ i∈S 𝜃i 𝐾(𝑥,𝑥i)
where 𝑆={indices of support vectors} and K:ℝn×ℝn→ℝ are kernel functions

60
Q

What are the advantages of linear/logistic regression over support vector machines?

A
  • can adjust threshold and shape TPR and FPR
  • Get a probabilistic interpretation
61
Q

What are the advantages of support vector machines over logistic regression?

A
  • good against noise far away from true decision boundary
  • perfectly separates data when possible
62
Q

Why is gradient descent a useful algorithm?

A

(X^TX)-1 doesn’t have to be computed, therefore it is computationally cheaper

63
Q

What do convex functions have?

A

A unique global minima

64
Q

What is the equation for calculating Term Frequency?

A

TF= number of times the term appears in text / Total number of terms in text

65
Q

What is the equation for Inverse Document Frequency?

A

IDF = log 10 (Number of Documents / number of documents with term in it )

66
Q

What is the equation for TFIDF?

A

TFIDF = TF x IDF

67
Q

What does a low TFIDF indicate?

A

rare words

68
Q

Why do we use PCA?

A

because it is computationally expensive to solve OLS for large data sets

69
Q

What is the issue with large dimension data?

A

many optimal models with MSE=0.
XTX will typically be noninvertible
cannot use OLS

70
Q

Why do we use unsupervised learning?

A

Most data sets are unlabelled
it is costly to label data sets

71
Q

What is the Singular Value decomposition of X?

A

X=U∑W^T
U => unitary matrix where UTU=I
W => unitary matrix where WTW=I
∑ => diagonal matrix with non negative elements ordered largest to smallest

72
Q

What are the columns of W in SVD?

A

the eigenvalues if the XTX matrix

73
Q

What is XTX as a SVD?

A

=W∑TUTU∑WT
=W∑T∑WT

74
Q

If a data point is equidistant from two cluster centres, how do you choose which one to assign it to?

A

The lowest one by convention

75
Q

What is the average dissimilarity in a cluster equation?

A

1/ number of elements in cluster X sum of the L2 Euclidian Norm squared (sum of the squares between the cluster and data point)

76
Q

What is the K-Means algorithm?

A
  1. Randomly assign a number from 1 to K for each data point
  2. Iterate until the cluster assignments stop changing
    • Compute centroid for each cluster
    • Update cluster assignment to closest cluster
      centre
77
Q

What are the advantages of the K-Means algorithm?

A

More computationally efficient than brute force method

78
Q

What are the disadvantages of the K-Means algorithm?

A
  • must select number of clusters
  • Doesn’t necessarily converge to optimal clusters
  • cannot handle non-convex clusters
79
Q

In ARX models how do you obtain an unbiased estimate from the least squares solution?

A

if Psi doesn’t have any noise

80
Q

How is a AR model displayed?

A

AR(ny)

81
Q

How is an ARX model displayed?

A

ARX(ny, nu)

82
Q

How is an ARMAX model displayed?

A

ARMAX(ny, ne, nu)

83
Q

What does the Moving Average mean in an ARMAX model?

A

The model is dependent on delayed error/noise.

84
Q

What would applying OLS to an ARMAX model result in?

A

A biased estimation

85
Q

How would you show ARX model is unbiased?

A

E(theta) = E (OLS sol)
sub X Theta + e into Y
ends up equal to theta *

86
Q

What is the OLS solution with L2 regularization?

A

theta = (XTX + lambda I)^-1 XT Y