Learning From Data Flashcards
Difference between Classification and Regression
classification is about predicting a label
regression is about predicting a quantity
Classification is the task of predicting a discrete class label. Regression is the task of predicting a continuous quantity.
multi-class classification problem
A problem with more than two classes
multi-label classification problem.
A problem where an example is assigned multiple classes
datum
data
a piece of information
a fixed starting point of a scale or operation
k nearest neighbours
1) find the k nearest neighbors to x in the training data
2) assign x to the class with the most k nearest neighbors
Unsupervised Learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
Given: data
D = {xn}, n = 1, . . . , N
and a parameterised generative model describing how the data might be generated, p(x; w), depending on parameters w.
Supervised Learning
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
Hyperparameter
a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training
Multi variate data
More than one variable is measured on each individual in a sample
Centroid
The mean of each variable, into a vector
Properties of data that has been sphered
for each variable
mean=0
variance=1
all the variables are mutually uncorrelated
Disadvantages to Euclidean Distance
is popular for numerical data, but:
it gives equal weight to all variables
it disregards correlations between variables
Reasons for sphering
Sphering the data puts the variables on an equal footing and removes (linear) correlations
What can i.i.d. stand for
independent and identically
distributed
Examples of Unsupervised Learning methods
Clustering Gaussian distribution Mixture model Principal Component Analysis Kohonen maps (SOMs)
Deterministic Model
In deterministic models, the output of the model is
fully determined by the parameter values and the
initial conditions.
Main aim of classification
Train a machine F to map features to targets
Main aim of regression
Train a machine F to map features to continuous targets
What the different types/formats of variables?
numerical: continuous or discrete
categorical: nominal or ordinal
binary: presence/absence or 2-state categorical
In a data matrix, what does X_nd refer to?
Xnd is the value of the dth variable for the nth individual
i.e. observations are rows.
How to measure association between 2 variables?
Covariance, (S_12)^2
Mean and variance of a standardised variable
Mean = 0 Variance = 1
‘Standardised measure of association’ between variables
Correlation coefficient, R_12
Does the correlation coefficient lie in a given range?
Yes
[-1,1]
Interpret what the values of the correlation coefficient imply.
R12 > 0 variables increase and decrease together
R12 < 0 one variable decreases as the other increases
R12 ≈ 0 variables not associated (roughly circular scatter diagram)
How to obtain the correlation coefficient?
obtained by dividing the covariance by the product of the standard deviations
What is the main difference between the correlation coefficient and the covariance?
Correlation values are standardized whereas, covariance values are not.
Correlation is a special case of covariance which can be obtained when the data is standardized
Negative of using the covariance matrix?
The value of covariance is affected by the change in scale of the variables
What are the entries of the main diagonal on the correlation matrix?
Main diagonal of correlation coefficient are all 1, since the variables have been standardized
Covariance of sphered variables?
Identity matrix
Is correlation a linear measure?
Yes
What is the squared Mahalanobis distance equal to?
The squared Euclidean distance of the sphered data
Does the mahalanobis distance use the Covariance matrix or the Correlation coefficient?
Cov
actually in the inverse of the cov
What are used as the parameters for multivariate normal density?
Mean VECTOR
COVARIANCE MATRIX
How would you estimate the parameters of a multivariate normal density from a sample?
Parameters µ and Σ estimated by maximum likelihood
OR Bayesian statistics.
2 usual components of supervised learning
Systematic: average response
Random: variability of observations
Likelihood function
Probability that a datum x was generated by model p is the conditional probability p(x|w)
The overall likelihood for all the data makes what assumption?
Independence of observations
Error function for Regression with Gaussian noise
(mean squared error)
sum of squares errors
Error function for Classification
Cross entropy/log loss
pseudo-inverse of X
X†X = I when X is square and invertible.
It’s the best approximation when X is rectangular or singular
Negatives of the Linear Regression model
Contribution to least squares error is largest from targets with largest errors.
Susceptible to outliers.
p(t | x) is not always Gaussian
Error function for Regression with Laplacian noise
sum of absolute errors
Approaches to Non-linear Regressions
Transfer function
MLP - multi layer perceptrons
Basis functions
Negative of MLP approach
May be difficult to learn
Examples of choices for basis functions (In Regression)
Fourier
Radial (Gaussian Radial Basis functions)
Wavelets
General/common properties of basis functions
local centred on (some of) the training data
General method of using basis functions in Non-linear regression
apply non-linearity before linear regression
Define generalisation error
In supervised learning applications.
It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data
Consequence of too few and too many hidden units in the MLP model?
Too few - inflexible network, poor generalisation
Too many - over-fitting, poor generalisation
How can over-fitting be combatted?
Cross-validation
Outline the steps in Cross-validation and N-fold Cross-validation
- Divide the training data into two sets: training, validation – surrogate test set
- Train on training set
- Evaluate “test” error on validation set
- Adjust number of parameters/hidden units for best generalisation on validation set
K-fold validation
- reshuffle the data
- Randomly partition into k training and validation sets and average the validation error over all k sets.
Downside of k-fold cross validation?
K times more expensive that ordinary Cross validation
Not as good on small data sets
Define regularisation
Regularisation is the process of adding information in order to prevent overfitting (by improving an already poorly over-fitted model)
i.e. penalise overly complicated models by regularisation terms
Examples of regularisation terms
Weight decay regularisation
Minimum description length
Support Vector Machines