Learning From Data Flashcards
Difference between Classification and Regression
classification is about predicting a label
regression is about predicting a quantity
Classification is the task of predicting a discrete class label. Regression is the task of predicting a continuous quantity.
multi-class classification problem
A problem with more than two classes
multi-label classification problem.
A problem where an example is assigned multiple classes
datum
data
a piece of information
a fixed starting point of a scale or operation
k nearest neighbours
1) find the k nearest neighbors to x in the training data
2) assign x to the class with the most k nearest neighbors
Unsupervised Learning
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
Given: data
D = {xn}, n = 1, . . . , N
and a parameterised generative model describing how the data might be generated, p(x; w), depending on parameters w.
Supervised Learning
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
Hyperparameter
a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training
Multi variate data
More than one variable is measured on each individual in a sample
Centroid
The mean of each variable, into a vector
Properties of data that has been sphered
for each variable
mean=0
variance=1
all the variables are mutually uncorrelated
Disadvantages to Euclidean Distance
is popular for numerical data, but:
it gives equal weight to all variables
it disregards correlations between variables
Reasons for sphering
Sphering the data puts the variables on an equal footing and removes (linear) correlations
What can i.i.d. stand for
independent and identically
distributed
Examples of Unsupervised Learning methods
Clustering Gaussian distribution Mixture model Principal Component Analysis Kohonen maps (SOMs)
Deterministic Model
In deterministic models, the output of the model is
fully determined by the parameter values and the
initial conditions.
Main aim of classification
Train a machine F to map features to targets
Main aim of regression
Train a machine F to map features to continuous targets
What the different types/formats of variables?
numerical: continuous or discrete
categorical: nominal or ordinal
binary: presence/absence or 2-state categorical
In a data matrix, what does X_nd refer to?
Xnd is the value of the dth variable for the nth individual
i.e. observations are rows.
How to measure association between 2 variables?
Covariance, (S_12)^2
Mean and variance of a standardised variable
Mean = 0 Variance = 1
‘Standardised measure of association’ between variables
Correlation coefficient, R_12
Does the correlation coefficient lie in a given range?
Yes
[-1,1]
Interpret what the values of the correlation coefficient imply.
R12 > 0 variables increase and decrease together
R12 < 0 one variable decreases as the other increases
R12 ≈ 0 variables not associated (roughly circular scatter diagram)
How to obtain the correlation coefficient?
obtained by dividing the covariance by the product of the standard deviations
What is the main difference between the correlation coefficient and the covariance?
Correlation values are standardized whereas, covariance values are not.
Correlation is a special case of covariance which can be obtained when the data is standardized
Negative of using the covariance matrix?
The value of covariance is affected by the change in scale of the variables
What are the entries of the main diagonal on the correlation matrix?
Main diagonal of correlation coefficient are all 1, since the variables have been standardized
Covariance of sphered variables?
Identity matrix
Is correlation a linear measure?
Yes
What is the squared Mahalanobis distance equal to?
The squared Euclidean distance of the sphered data
Does the mahalanobis distance use the Covariance matrix or the Correlation coefficient?
Cov
actually in the inverse of the cov
What are used as the parameters for multivariate normal density?
Mean VECTOR
COVARIANCE MATRIX
How would you estimate the parameters of a multivariate normal density from a sample?
Parameters µ and Σ estimated by maximum likelihood
OR Bayesian statistics.
2 usual components of supervised learning
Systematic: average response
Random: variability of observations
Likelihood function
Probability that a datum x was generated by model p is the conditional probability p(x|w)
The overall likelihood for all the data makes what assumption?
Independence of observations
Error function for Regression with Gaussian noise
(mean squared error)
sum of squares errors
Error function for Classification
Cross entropy/log loss
pseudo-inverse of X
X†X = I when X is square and invertible.
It’s the best approximation when X is rectangular or singular
Negatives of the Linear Regression model
Contribution to least squares error is largest from targets with largest errors.
Susceptible to outliers.
p(t | x) is not always Gaussian
Error function for Regression with Laplacian noise
sum of absolute errors
Approaches to Non-linear Regressions
Transfer function
MLP - multi layer perceptrons
Basis functions
Negative of MLP approach
May be difficult to learn
Examples of choices for basis functions (In Regression)
Fourier
Radial (Gaussian Radial Basis functions)
Wavelets
General/common properties of basis functions
local centred on (some of) the training data
General method of using basis functions in Non-linear regression
apply non-linearity before linear regression
Define generalisation error
In supervised learning applications.
It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data
Consequence of too few and too many hidden units in the MLP model?
Too few - inflexible network, poor generalisation
Too many - over-fitting, poor generalisation
How can over-fitting be combatted?
Cross-validation
Outline the steps in Cross-validation and N-fold Cross-validation
- Divide the training data into two sets: training, validation – surrogate test set
- Train on training set
- Evaluate “test” error on validation set
- Adjust number of parameters/hidden units for best generalisation on validation set
K-fold validation
- reshuffle the data
- Randomly partition into k training and validation sets and average the validation error over all k sets.
Downside of k-fold cross validation?
K times more expensive that ordinary Cross validation
Not as good on small data sets
Define regularisation
Regularisation is the process of adding information in order to prevent overfitting (by improving an already poorly over-fitted model)
i.e. penalise overly complicated models by regularisation terms
Examples of regularisation terms
Weight decay regularisation
Minimum description length
Support Vector Machines
How to determine alpha in Weight Decay regularisation?
Cross validation
What types of problems are SVMs designed for: Regression or classification?
Classification
Is the SVM model nonlinear or linear?
Linear
Aim of linear regression
Fitting a line, plane or hyperplane through data
Posterior probability
The probability that a data point belongs to a certain class p( C_k | x )
Prior class probability
p( C_k )
Often the proportion of samples in Ck
Bayes error
Bayes error rate is the minimum error that may be achieved in a classification problem By assigning x to the class with the largest posterior probability It is achieved if the posterior class probabilities are known exactly
Confusion matrix, Ĉ
Ĉ_kj is the number of instances of class k that are classified as class j. (square matrix)
Confusion RATE matrix, C
Normalise the confusion matrx, Ĉ, so that each row sums to 1
False positive
A test result which wrongly indicates that a particular condition or attribute is present.
What types of classifiers can be used in ROC analysis?
Soft and Probabilistic
Soft Classifier
Produces a score as an output. yn = F(xn; w) ∈ R i.e. the output is a continuous value Classify xn to C1 if F(xn; w) > λ for some threshold; otherwise allocate xn to C2
Probabilistic Classifier
Classifier produces a probability score [0,1]
yn = F(xn; w) ∈ (0, 1)
Can be interpreted as posterior probability p(C1 | xn)
Maximum accuracy if xn is assigned to C1 if
F(xn; w) > λ = 1/2
otherwise allocate xn to C2
Hard Classifier
Classifier produces an allocation to a class i.e. the output is a 'class'/category yn = F(xn; w) ∈ {C1, C2}
In ROC analysis, determine the measurements on the axis of the ROC curve
y-axis ~ true positive rate
x-axis ~ False positive rate
What does ROC stand for?
Receiver Operating Characteristic
How to construct an ROC curve
For each threshold:
1) Evaluate confusion matrix
2) Plot TPR vs FPR
What does the diagonal line in a ROC curve depict?
The diagonal line shows the performance of the classifier that allocates at random
Another term for discriminating function
Separating surface
How to distinguish which class to allocate to by a discriminant function?
The sign
Discriminant function = 0 gives the decision boundary
Define a function y(x) so that x is classified as class C1 iff y(x) > 0
Is logistic regression a regression model or a classification model?
Classifier
Error function for logistic regression
Cross Entropy function
Define seperable classes
If the class conditional density functions do not overlap then the classes are separable.
Give an example of when Bayes error is 0
When classes are separable. The class conditional density functions do not overlap.
Linear separability
Classes that can be separated by a line or (hyper)plane
What are the outputs of the Logistic discriminant?
Approximate posterior probabilities
How to use basis functions for classification
Transform x; then use logistic regression
Non-linear models for classification
MLP
Basis functions
Cover’s theorem
A dataset mapped to a higher dimensional space is more likely to be linearly separable
How to design a hard classifier (from the MLP model)
Make the activation function to a step function
Loss matrix L_kj
Lkj quantifes the penalty of assigning x to Ck when it
belongs to Cj
Costs are relative to an additive constant.
Often we count the cost of correct classification as zero, so
Lkk = 0 for all k
Risk of a loss matrix
Average cost of making a classification.
Averaged over all classes.
Activation function in logistic discriminant?
Sigmoidal
g(a) = 1/(1+exp(-a))
Output of logistic discriminant?
0 < y(x) < 1
output may be interpreted as a probability of class membership
outputs approximate posterior probabilities
Types of kernels
linear
RBF
sigmoidal
Properties of the types of basis function used in classification?
local
radially symmetric
centred on (some of) the training data
Role of w0 in Linear Discriminant analysis?
w0 ~ bias
Controls the distance of the (linear) boundary from the origin
Order of partition in clustering - give another term for this and define.
Order of partition = model complexity
How many clusters we are using to model the data
Approaches to clustering
Sequential clustering Hierarchical algorithms Algorithms based on optimization Spectral clustering Graph-theoretical methods Statistical approaches
Properties of clusters in hard clustering
No cluster can be empty
The union of all the clusters is the Partition set (set of all clusters)
The intersection of 2 distinct clusters is the empty set
Inputs of the Sequential clustering
Ordered n inputs elements
dissimilarity measure d(-,-)
Cluster radius
Max number of clusters Q
Pros and cons of Hierarchical clustering
PROS
Since its generates a hierarchy of partitions with different resolutions, it is flexible and allows choice of resolution level
Can be applied to many different data types (not only vectors)
CONS
Heavy on the computational side
Name the 2 types of algorithms in Hierarchical clustering
Agglomerative
Divisive
Dendogram of clustering
Sequence of clusterings
Goal of clustering (what do we want to optimise?)
MAXIMISE the INTRA-cluster similarity,
while at the same time,
MINIMISE the INTER-cluster similarity
In general, a partition is “good” if related clusters are compact and separated
Initial starting point in agglomerative and divisive clustering approaches.
Agglomerative: start from N singleton clusters
Divisive: initially contains only one cluster
3 main graph clustering methods
Topology based (min spanning tree) Spectral (graph in matrix form) Random walks
Negatives of graphical clustering methods
demanding in terms of space and time computational complexity
What are you trying to minimise in K-clustering
The sum of the squared euclidean distances of each data point with all other data points within each cluster
The sum of the squared euclidean distances of each data point with a cluster representative within each cluster
Pros and cons of K-means clustering
PROS
It is easy to implement
can be applied to virtually any input domain (i.e., input data that is
not defined as vectors of real-numbers)
CONS
typically finds sub-optimal solutions
Does not perform well on high-dimensional data and on datasets with clusters that cannot be modeled by covariances
Doesn’t work well for different variances
Inputs of k-means clustering
Data
Order K (no. of clusters)
Max no. of iterations
What are kernel functions a measure of (in clustering context)?
Similarity
NOT distance
A method of clustering for non-spherical/ellipsoid data/nonlinear patterns.
Kernel k-means
Example of a kernel function commonly used in kernel k-means clustering
RBF
k(x, z) = exp( −γ (||x − z||_2)^2 )
(Gaussian when γ =1 / (2σ)^2 )
In kernel k-means, is the mapping of ϕ implicit/explicit?
Implicit
Negatives of kernel k-means clustering
More hyper-parameters to define
Since the embedding is implicit, the centroids are not defined explicitly and don’t have the explicit formulation for the model.
Hence it is difficult to extend to out-of-sample data.
Another term for a reference partition in clustering validation?
Ground-truth
Key difference of Spectral clustering
Uses a similarity MATRIX
Can spectral clustering be applied to data that isn’t in vector format?
Yes, as long as you can quantify similarities between inputs.
Give an example of S in spectral clustering for vector data input
RBF kernel
How to determine the number of connected components in spectral clustering?
The number of zero eigenvalues of L, the laplacian matrix
What are the range of eigenvalues of the normalised Laplacian matrix in spectral clustering?
[0,2]
Is it guaranteed that there will be at least one zero eigenvalue of L in spectral clustering?
Yes, since the input will be at least one connected component
What are indicator vectors in Spectral Clustering?
Binary vectors
That encode which vertices belong to which (connected) component
Complexity of Spectral clustering
n data points (S is nxn)
Then the eigen decomposition has complexity n^3
S_ij in spectral clustering
S is the similarity matrix
Symmetric?
S_ij = s( xi, xj ) denotes the similarity between xi and xj
In spectral clustering, define: W deg(vi) D Volume U L
W: weighted adjacency matrix
deg(vi): sum of all the weights of the edges attached to vi
D = diag(deg(v1), deg(v2), …., deg(vi))
Volume: if A is a subset of vertices, the volume is the sum of the degrees of all the vertices in A
U: matrix of all eigenvectors, corresponding to the increasing set of eigvectors
L = D - W
Properties of L in spectral clustering
Positive semi definite
positive-semi definite, i.e., it has non-negative real eigenvalues
λ1 = 0 is always an eigenvalue of L, where 1 is the corresponding
eigenvector
Types of similarity graphs:
For fully connected graphs, use RBF.
k-nearest neighbour: connect vertices if vj is among the k nearest neighbour of vi or vice-versa
epsilon-nearest neighbour: Thresholding of weights
Wij = wij if d(vi, vj) ≤ epsilon; 0 otherwise
The curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
if the amount of available training data is fixed, then overfitting occurs if we keep adding dimensions. On the other hand, if we keep adding dimensions, the amount of training data needs to grow exponentially fast to maintain the same coverage and to avoid overfitting.
An orthonormal matrix
A square, real-valued matrix U
its rows and columns are orthonormal vectors
UU^T = I, where I is an identity matrix
U−1 = UT
PCA basis vectors
Principal components
What is the choice of U trying to minimise in PCA?
The approximation error
Are the PCs correlated?
No
When should you use PCA on the correlation matrix instead of the covariance?
When input features are on very different scale
ranges, i.e., the standard deviation of the features is very different
In PCA, what do the corresponding eigenvalues of eigendecomposition tell you?
λk is the k-th eigenvalue
Measures the importance of k-th PC
(variance)
Quantify the mean squared projection (variance) onto each principal component
Generally, how to determine how many PCs you need?
Check the cumulative explained variance and select a
number of PCs that explains at least ~ 80%
Cummulative explained variance Graph
First divide each eigenvalue by the sum of all the eigenvalues
Then plot the graph of summing these together one at a time
Can help determine where your cut off for the number of PCs should be
Note that each eigenvalue gives the variance that PC covers?
What is a significant unique (good) feature of PCA?
Allows to back-project in input space from the space spanned by the PCs
Aims of PCA
PCA minimizes the approximation error when projecting data onto any linear subspace
Provide “natural” coordinates for the data: uncorrelated and compact
As well as dimensionality reduction, what else can PCA be used for?
Noise reduction
2 main types of classifiers
Generative and Discriminant
Generative: Guassian mixture model (Linear regression, LDA)
Discriminant: Logistic regression
generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution)
If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method.
p( x | C_k )
p( C_k | x )
p( x | C_k ) class conditional density p( C_k | x ) posterior probability