Intro Flashcards
Why is machine learning popular?
-Lots of data available
-current control theory methods struggle to solve large scale complex problems
What are the types of supervised learning?
regression and classification
what are the types of unsupervised learning?
clustering and dimensionality reduction
what are the types of reinforcement learning?
Value iteration and policy iteration
What is supervised learning?
a function that maps an input to an output based on labelled example input output pairs
what is unsupervised learning?
an algorithm that learns patterns from un labelled data
What is the key difference between regression and classification?
in regression the data is continuous whereas discrete data is used for classification
How does regression work?
find a function that minimises a cost function (most often mean squared error)
Describe a nearest neighbour model?
individual data point is grouped depending on proximity
Describe a piecewise linear model
data follows different linear trends over different regions of the data
What are some model types?
Linear, low order polynomial, high order polynomial, piecewise linear, nearest neighbour
When does overfitting occur?
- when a model fits the data set too well and is unable to generalise
- low density of data
What is a characteristic of overfitting?
oversensitivity to measurement noise
How can overfitting be avoided?
do not use a model that is more complicated than required (Occam’s razor)
What is a white box model?
-increased system information
-low model uncertainty
What is a black box model?
-decreased system information
-high model uncertainty
what is inference?
the process in which prediction is made
What does the expected mean square error of the prediction depend on?
bias and variance
what is meant by high bias?
model fails to capture the underlying structure of the data (underfitting)
what is meant by high variance?
model is sensitive to small fluctuations in the data (overfitting)
when is variance high?
in complex models
what is the bias-variance trade-off?
If biased is increased then variance decreases and vice versa. Therefore need to minimise both bias and variance.
what is meant by error?
the error between the true value and the predicted value
what happens in simple linear regression?
identify a line of best fit y=a0+a1x+err, where a0 and a1 need to be determined
How do you find a0 and a1 that minimises the sums of square errors of residuals?
-find stationary points by taking partial derivatives of sum of square residuals
-set to zero to solve optimization problem
What is the ordinary least squares (OLS) method?
approximately select X, Y an theta, and solve
theta = (XT X)^-1(XT Y)
What is the X matrix known as?
design matrix
regressor matrix
When does the OLS method not work for linear regression?
- If XTX is not invertible, it cannot be solved
- not invertible if the the OLS problem has non-unique solutions
What is meant by collinearity?
two sequences of data are said to be colinear if there exists k not=0 such that x1i=kx2i
What occurs in the OLS if there is a pair of feature data sequences that are collinear?
the associated OLS has infinite optimal solutions
when does collinearity occur?
when two feature variables are highly correlated providing redundant information
How can you deal with collinearity in data?
- increase the amount of training data
- find and remove highly correlated data
What are some issues with the OLS method?
- computing inverse of XTX can be computationally expensive
- if the data is close to being collinear then the OLS solution becomes very sensitive to small changes in the training data set
Which models can be fit using the OLS method?
those linear in parameters
How can the goodness of fit of a regression model be assessed?
Using R^2 coefficient
what does a R^2 value approximately equal to 1 indicate?
sum of squared error is small, therefore a good model
What does a negative R^2 value mean?
very bad model
what does a small positive R^2 value indicate?
bad model
What is the Weierstrass Approximation Theorem?
For any continuous function f, with continuous interval [a b] and E>0, there exists a polynomial p, such that sup|f(x)-p(x)|<E
What is regularization used for?
- Prevent overfitting to training data
- remove user choice from a model
Why do we use regularization?
- Increasing model parameters will fit the training data more accurately, but unnecessary terms can cause overfitting.
What is classification?
Supervised learning of discrete data
What does a Bayes classifier do differently?
It constructs a probability distribution instead of a model
What is a perceptron?
An algorithm that describes the classification rules for hyperplanes in supervised learning of binary classifiers.
The simplest Neural Network
What issues arise in non-linear regressions?
no unique solution
settle for approximate solutions (newton raphson method)
What does gradient descent do?
Identifies local minima, solving OLS
What are the issues with high dimension feature space?
-Hard to visualise data in large dimensions
-OLS fails
How do you find Principal Components?
- Compute centred data matrix X~
- Compute X~^T X~
- Find orthonormal eigenvector of X~^T X~ with the largest eigenvalue
What are the principal components?
-The orthonormal vector direction with the largest variation in data
- The orthonormal vectors that define a linear manifold giving minimal reconstruction error
- The orthonormal eigenvectors of X~^T X~ with the largest eigenvalues
- q columns of W corresponding to the q largest squared singular value where singular value decomposition of the regressor matrix is X=UEW^T
What is clustering?
A class of unsupervised learning methods that separates data into groups by similarity
What does the Weierstrass Theorem do? (in simpler terms)
provides a guarantee that certain functions can be approximated to arbitrarily high accuracy by a finite degree polynomial
provided the function is defined over a finite interval
when is a matrix non invertible?
when the determinant is equal to zero
What are the disadvantages of using polynomials in modelling?
- Many coefficients and parameters in high degree polynomials
- No guarantee of approximating discontinuous functions e.g tan(x)
- Slow convergence rates
Polynomials tend towards infinity, which in unnatural system behaviour
How does regularization prevent overfitting?
It penalises unnecessary non-zero parameters to help prevent the model becoming oversensitive to noise in the training data
How do we select lambda in regularization?
randomly sample to find optimal lambda
performance typically evaluated through cross-validation
What is the perceptron equation?
f(x)=sgn(w^Tx)
What is the structure of the perceptron?
x, w, sum everything, step, output
How do you form a non-linear decision boundary?
Add more basis functions
What is the equation for a support vector machine?
𝑓(𝑥;𝜃)=𝜃0 + ∑ i∈S 𝜃i 𝐾(𝑥,𝑥i)
where 𝑆={indices of support vectors} and K:ℝn×ℝn→ℝ are kernel functions
What are the advantages of linear/logistic regression over support vector machines?
- can adjust threshold and shape TPR and FPR
- Get a probabilistic interpretation
What are the advantages of support vector machines over logistic regression?
- good against noise far away from true decision boundary
- perfectly separates data when possible
Why is gradient descent a useful algorithm?
(X^TX)-1 doesn’t have to be computed, therefore it is computationally cheaper
What do convex functions have?
A unique global minima
What is the equation for calculating Term Frequency?
TF= number of times the term appears in text / Total number of terms in text
What is the equation for Inverse Document Frequency?
IDF = log 10 (Number of Documents / number of documents with term in it )
What is the equation for TFIDF?
TFIDF = TF x IDF
What does a low TFIDF indicate?
rare words
Why do we use PCA?
because it is computationally expensive to solve OLS for large data sets
What is the issue with large dimension data?
many optimal models with MSE=0.
XTX will typically be noninvertible
cannot use OLS
Why do we use unsupervised learning?
Most data sets are unlabelled
it is costly to label data sets
What is the Singular Value decomposition of X?
X=U∑W^T
U => unitary matrix where UTU=I
W => unitary matrix where WTW=I
∑ => diagonal matrix with non negative elements ordered largest to smallest
What are the columns of W in SVD?
the eigenvalues if the XTX matrix
What is XTX as a SVD?
=W∑TUTU∑WT
=W∑T∑WT
If a data point is equidistant from two cluster centres, how do you choose which one to assign it to?
The lowest one by convention
What is the average dissimilarity in a cluster equation?
1/ number of elements in cluster X sum of the L2 Euclidian Norm squared (sum of the squares between the cluster and data point)
What is the K-Means algorithm?
- Randomly assign a number from 1 to K for each data point
- Iterate until the cluster assignments stop changing
- Compute centroid for each cluster
- Update cluster assignment to closest cluster
centre
What are the advantages of the K-Means algorithm?
More computationally efficient than brute force method
What are the disadvantages of the K-Means algorithm?
- must select number of clusters
- Doesn’t necessarily converge to optimal clusters
- cannot handle non-convex clusters
In ARX models how do you obtain an unbiased estimate from the least squares solution?
if Psi doesn’t have any noise
How is a AR model displayed?
AR(ny)
How is an ARX model displayed?
ARX(ny, nu)
How is an ARMAX model displayed?
ARMAX(ny, ne, nu)
What does the Moving Average mean in an ARMAX model?
The model is dependent on delayed error/noise.
What would applying OLS to an ARMAX model result in?
A biased estimation
How would you show ARX model is unbiased?
E(theta) = E (OLS sol)
sub X Theta + e into Y
ends up equal to theta *
What is the OLS solution with L2 regularization?
theta = (XTX + lambda I)^-1 XT Y