Learning From Data Flashcards

1
Q

Difference between Classification and Regression

A

classification is about predicting a label
regression is about predicting a quantity

Classification is the task of predicting a discrete class label.
Regression is the task of predicting a continuous quantity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

multi-class classification problem

A

A problem with more than two classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

multi-label classification problem.

A

A problem where an example is assigned multiple classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

datum

A

data
a piece of information
a fixed starting point of a scale or operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k nearest neighbours

A

1) find the k nearest neighbors to x in the training data

2) assign x to the class with the most k nearest neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unsupervised Learning

A

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

Given: data
D = {xn}, n = 1, . . . , N
and a parameterised generative model describing how the data might be generated, p(x; w), depending on parameters w.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Supervised Learning

A

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hyperparameter

A

a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Multi variate data

A

More than one variable is measured on each individual in a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Centroid

A

The mean of each variable, into a vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Properties of data that has been sphered

A

for each variable
mean=0
variance=1
all the variables are mutually uncorrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Disadvantages to Euclidean Distance

A

is popular for numerical data, but:
it gives equal weight to all variables
it disregards correlations between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reasons for sphering

A

Sphering the data puts the variables on an equal footing and removes (linear) correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can i.i.d. stand for

A

independent and identically

distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Examples of Unsupervised Learning methods

A
Clustering
Gaussian distribution
Mixture model
Principal Component Analysis
Kohonen maps (SOMs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Deterministic Model

A

In deterministic models, the output of the model is
fully determined by the parameter values and the
initial conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Main aim of classification

A

Train a machine F to map features to targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Main aim of regression

A

Train a machine F to map features to continuous targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What the different types/formats of variables?

A

numerical: continuous or discrete
categorical: nominal or ordinal
binary: presence/absence or 2-state categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In a data matrix, what does X_nd refer to?

A

Xnd is the value of the dth variable for the nth individual

i.e. observations are rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to measure association between 2 variables?

A

Covariance, (S_12)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Mean and variance of a standardised variable

A
Mean = 0 
Variance = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

‘Standardised measure of association’ between variables

A

Correlation coefficient, R_12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Does the correlation coefficient lie in a given range?

A

Yes

[-1,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Interpret what the values of the correlation coefficient imply.

A

R12 > 0 variables increase and decrease together
R12 < 0 one variable decreases as the other increases
R12 ≈ 0 variables not associated (roughly circular scatter diagram)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to obtain the correlation coefficient?

A

obtained by dividing the covariance by the product of the standard deviations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the main difference between the correlation coefficient and the covariance?

A

Correlation values are standardized whereas, covariance values are not.
Correlation is a special case of covariance which can be obtained when the data is standardized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Negative of using the covariance matrix?

A

The value of covariance is affected by the change in scale of the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the entries of the main diagonal on the correlation matrix?

A

Main diagonal of correlation coefficient are all 1, since the variables have been standardized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Covariance of sphered variables?

A

Identity matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Is correlation a linear measure?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the squared Mahalanobis distance equal to?

A

The squared Euclidean distance of the sphered data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Does the mahalanobis distance use the Covariance matrix or the Correlation coefficient?

A

Cov

actually in the inverse of the cov

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are used as the parameters for multivariate normal density?

A

Mean VECTOR

COVARIANCE MATRIX

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How would you estimate the parameters of a multivariate normal density from a sample?

A

Parameters µ and Σ estimated by maximum likelihood

OR Bayesian statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

2 usual components of supervised learning

A

Systematic: average response
Random: variability of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Likelihood function

A

Probability that a datum x was generated by model p is the conditional probability p(x|w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

The overall likelihood for all the data makes what assumption?

A

Independence of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Error function for Regression with Gaussian noise

A

(mean squared error)

sum of squares errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Error function for Classification

A

Cross entropy/log loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

pseudo-inverse of X

A

X†X = I when X is square and invertible.

It’s the best approximation when X is rectangular or singular

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Negatives of the Linear Regression model

A

Contribution to least squares error is largest from targets with largest errors.
Susceptible to outliers.
p(t | x) is not always Gaussian

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Error function for Regression with Laplacian noise

A

sum of absolute errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Approaches to Non-linear Regressions

A

Transfer function
MLP - multi layer perceptrons
Basis functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Negative of MLP approach

A

May be difficult to learn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Examples of choices for basis functions (In Regression)

A

Fourier
Radial (Gaussian Radial Basis functions)
Wavelets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

General/common properties of basis functions

A
local
centred on (some of) the training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

General method of using basis functions in Non-linear regression

A

apply non-linearity before linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Define generalisation error

A

In supervised learning applications.

It is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Consequence of too few and too many hidden units in the MLP model?

A

Too few - inflexible network, poor generalisation

Too many - over-fitting, poor generalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How can over-fitting be combatted?

A

Cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Outline the steps in Cross-validation and N-fold Cross-validation

A
  1. Divide the training data into two sets: training, validation – surrogate test set
  2. Train on training set
  3. Evaluate “test” error on validation set
  4. Adjust number of parameters/hidden units for best generalisation on validation set

K-fold validation

  • reshuffle the data
  • Randomly partition into k training and validation sets and average the validation error over all k sets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Downside of k-fold cross validation?

A

K times more expensive that ordinary Cross validation

Not as good on small data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Define regularisation

A

Regularisation is the process of adding information in order to prevent overfitting (by improving an already poorly over-fitted model)
i.e. penalise overly complicated models by regularisation terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Examples of regularisation terms

A

Weight decay regularisation
Minimum description length
Support Vector Machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How to determine alpha in Weight Decay regularisation?

A

Cross validation

57
Q

What types of problems are SVMs designed for: Regression or classification?

A

Classification

58
Q

Is the SVM model nonlinear or linear?

A

Linear

59
Q

Aim of linear regression

A

Fitting a line, plane or hyperplane through data

60
Q

Posterior probability

A
The probability that a data point belongs to a certain class
p( C_k | x )
61
Q

Prior class probability

A

p( C_k )

Often the proportion of samples in Ck

62
Q

Bayes error

A
Bayes error rate is the minimum error that may be achieved in a classification problem
By assigning x to the class with the largest posterior probability
It is achieved if the posterior class probabilities are known exactly
63
Q

Confusion matrix, Ĉ

A
Ĉ_kj is the number of instances of class k that are
classified as class j.
(square matrix)
64
Q

Confusion RATE matrix, C

A

Normalise the confusion matrx, Ĉ, so that each row sums to 1

65
Q

False positive

A

A test result which wrongly indicates that a particular condition or attribute is present.

66
Q

What types of classifiers can be used in ROC analysis?

A

Soft and Probabilistic

67
Q

Soft Classifier

A
Produces a score as an output.
yn = F(xn; w) ∈ R
i.e. the output is a continuous value
Classify xn to C1 if F(xn; w) > λ for some threshold;
otherwise allocate xn to C2
68
Q

Probabilistic Classifier

A

Classifier produces a probability score [0,1]
yn = F(xn; w) ∈ (0, 1)
Can be interpreted as posterior probability p(C1 | xn)
Maximum accuracy if xn is assigned to C1 if
F(xn; w) > λ = 1/2
otherwise allocate xn to C2

69
Q

Hard Classifier

A
Classifier produces an allocation to a class
i.e. the output is a 'class'/category
yn = F(xn; w) ∈ {C1, C2}
70
Q

In ROC analysis, determine the measurements on the axis of the ROC curve

A

y-axis ~ true positive rate

x-axis ~ False positive rate

71
Q

What does ROC stand for?

A

Receiver Operating Characteristic

72
Q

How to construct an ROC curve

A

For each threshold:

1) Evaluate confusion matrix
2) Plot TPR vs FPR

73
Q

What does the diagonal line in a ROC curve depict?

A

The diagonal line shows the performance of the classifier that allocates at random

74
Q

Another term for discriminating function

A

Separating surface

75
Q

How to distinguish which class to allocate to by a discriminant function?

A

The sign
Discriminant function = 0 gives the decision boundary

Define a function y(x) so that x is classified as class C1
iff y(x) > 0
76
Q

Is logistic regression a regression model or a classification model?

A

Classifier

77
Q

Error function for logistic regression

A

Cross Entropy function

78
Q

Define seperable classes

A

If the class conditional density functions do not overlap then the classes are separable.

79
Q

Give an example of when Bayes error is 0

A
When classes are separable.
The class conditional density functions do not overlap.
80
Q

Linear separability

A

Classes that can be separated by a line or (hyper)plane

81
Q

What are the outputs of the Logistic discriminant?

A

Approximate posterior probabilities

82
Q

How to use basis functions for classification

A

Transform x; then use logistic regression

83
Q

Non-linear models for classification

A

MLP

Basis functions

84
Q

Cover’s theorem

A

A dataset mapped to a higher dimensional space is more likely to be linearly separable

85
Q

How to design a hard classifier (from the MLP model)

A

Make the activation function to a step function

86
Q

Loss matrix L_kj

A

Lkj quantifes the penalty of assigning x to Ck when it
belongs to Cj
Costs are relative to an additive constant.
Often we count the cost of correct classification as zero, so
Lkk = 0 for all k

87
Q

Risk of a loss matrix

A

Average cost of making a classification.

Averaged over all classes.

88
Q

Activation function in logistic discriminant?

A

Sigmoidal

g(a) = 1/(1+exp(-a))

89
Q

Output of logistic discriminant?

A

0 < y(x) < 1
output may be interpreted as a probability of class membership
outputs approximate posterior probabilities

90
Q

Types of kernels

A

linear
RBF
sigmoidal

91
Q

Properties of the types of basis function used in classification?

A

local
radially symmetric
centred on (some of) the training data

92
Q

Role of w0 in Linear Discriminant analysis?

A

w0 ~ bias

Controls the distance of the (linear) boundary from the origin

93
Q

Order of partition in clustering - give another term for this and define.

A

Order of partition = model complexity

How many clusters we are using to model the data

94
Q

Approaches to clustering

A
Sequential clustering
Hierarchical algorithms
Algorithms based on optimization
Spectral clustering
Graph-theoretical methods
Statistical approaches
95
Q

Properties of clusters in hard clustering

A

No cluster can be empty
The union of all the clusters is the Partition set (set of all clusters)
The intersection of 2 distinct clusters is the empty set

96
Q

Inputs of the Sequential clustering

A

Ordered n inputs elements
dissimilarity measure d(-,-)
Cluster radius
Max number of clusters Q

97
Q

Pros and cons of Hierarchical clustering

A

PROS
Since its generates a hierarchy of partitions with different resolutions, it is flexible and allows choice of resolution level
Can be applied to many different data types (not only vectors)
CONS
Heavy on the computational side

98
Q

Name the 2 types of algorithms in Hierarchical clustering

A

Agglomerative

Divisive

99
Q

Dendogram of clustering

A

Sequence of clusterings

100
Q

Goal of clustering (what do we want to optimise?)

A

MAXIMISE the INTRA-cluster similarity,
while at the same time,
MINIMISE the INTER-cluster similarity

In general, a partition is “good” if related clusters are compact and separated

101
Q

Initial starting point in agglomerative and divisive clustering approaches.

A

Agglomerative: start from N singleton clusters
Divisive: initially contains only one cluster

102
Q

3 main graph clustering methods

A
Topology based (min spanning tree)
Spectral (graph in matrix form)
Random walks
103
Q

Negatives of graphical clustering methods

A

demanding in terms of space and time computational complexity

104
Q

What are you trying to minimise in K-clustering

A

The sum of the squared euclidean distances of each data point with all other data points within each cluster

The sum of the squared euclidean distances of each data point with a cluster representative within each cluster

105
Q

Pros and cons of K-means clustering

A

PROS
It is easy to implement
can be applied to virtually any input domain (i.e., input data that is
not defined as vectors of real-numbers)
CONS
typically finds sub-optimal solutions
Does not perform well on high-dimensional data and on datasets with clusters that cannot be modeled by covariances
Doesn’t work well for different variances

106
Q

Inputs of k-means clustering

A

Data
Order K (no. of clusters)
Max no. of iterations

107
Q

What are kernel functions a measure of (in clustering context)?

A

Similarity

NOT distance

108
Q

A method of clustering for non-spherical/ellipsoid data/nonlinear patterns.

A

Kernel k-means

109
Q

Example of a kernel function commonly used in kernel k-means clustering

A

RBF
k(x, z) = exp( −γ (||x − z||_2)^2 )
(Gaussian when γ =1 / (2σ)^2 )

110
Q

In kernel k-means, is the mapping of ϕ implicit/explicit?

A

Implicit

111
Q

Negatives of kernel k-means clustering

A

More hyper-parameters to define
Since the embedding is implicit, the centroids are not defined explicitly and don’t have the explicit formulation for the model.
Hence it is difficult to extend to out-of-sample data.

112
Q

Another term for a reference partition in clustering validation?

A

Ground-truth

113
Q

Key difference of Spectral clustering

A

Uses a similarity MATRIX

114
Q

Can spectral clustering be applied to data that isn’t in vector format?

A

Yes, as long as you can quantify similarities between inputs.

115
Q

Give an example of S in spectral clustering for vector data input

A

RBF kernel

116
Q

How to determine the number of connected components in spectral clustering?

A

The number of zero eigenvalues of L, the laplacian matrix

117
Q

What are the range of eigenvalues of the normalised Laplacian matrix in spectral clustering?

A

[0,2]

118
Q

Is it guaranteed that there will be at least one zero eigenvalue of L in spectral clustering?

A

Yes, since the input will be at least one connected component

119
Q

What are indicator vectors in Spectral Clustering?

A

Binary vectors

That encode which vertices belong to which (connected) component

120
Q

Complexity of Spectral clustering

A

n data points (S is nxn)

Then the eigen decomposition has complexity n^3

121
Q

S_ij in spectral clustering

A

S is the similarity matrix
Symmetric?
S_ij = s( xi, xj ) denotes the similarity between xi and xj

122
Q
In spectral clustering, define:
W
deg(vi)
D
Volume
U
L
A

W: weighted adjacency matrix
deg(vi): sum of all the weights of the edges attached to vi
D = diag(deg(v1), deg(v2), …., deg(vi))
Volume: if A is a subset of vertices, the volume is the sum of the degrees of all the vertices in A
U: matrix of all eigenvectors, corresponding to the increasing set of eigvectors
L = D - W

123
Q

Properties of L in spectral clustering

A

Positive semi definite
positive-semi definite, i.e., it has non-negative real eigenvalues
λ1 = 0 is always an eigenvalue of L, where 1 is the corresponding
eigenvector

124
Q

Types of similarity graphs:

A

For fully connected graphs, use RBF.

k-nearest neighbour: connect vertices if vj is among the k nearest neighbour of vi or vice-versa

epsilon-nearest neighbour: Thresholding of weights
Wij = wij if d(vi, vj) ≤ epsilon; 0 otherwise

125
Q

The curse of dimensionality

A

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.

if the amount of available training data is fixed, then overfitting occurs if we keep adding dimensions. On the other hand, if we keep adding dimensions, the amount of training data needs to grow exponentially fast to maintain the same coverage and to avoid overfitting.

126
Q

An orthonormal matrix

A

A square, real-valued matrix U
its rows and columns are orthonormal vectors
UU^T = I, where I is an identity matrix
U−1 = UT

127
Q

PCA basis vectors

A

Principal components

128
Q

What is the choice of U trying to minimise in PCA?

A

The approximation error

129
Q

Are the PCs correlated?

A

No

130
Q

When should you use PCA on the correlation matrix instead of the covariance?

A

When input features are on very different scale

ranges, i.e., the standard deviation of the features is very different

131
Q

In PCA, what do the corresponding eigenvalues of eigendecomposition tell you?

A

λk is the k-th eigenvalue
Measures the importance of k-th PC
(variance)

Quantify the mean squared projection (variance) onto each principal component

132
Q

Generally, how to determine how many PCs you need?

A

Check the cumulative explained variance and select a

number of PCs that explains at least ~ 80%

133
Q

Cummulative explained variance Graph

A

First divide each eigenvalue by the sum of all the eigenvalues
Then plot the graph of summing these together one at a time
Can help determine where your cut off for the number of PCs should be
Note that each eigenvalue gives the variance that PC covers?

134
Q

What is a significant unique (good) feature of PCA?

A

Allows to back-project in input space from the space spanned by the PCs

135
Q

Aims of PCA

A

PCA minimizes the approximation error when projecting data onto any linear subspace
Provide “natural” coordinates for the data: uncorrelated and compact

136
Q

As well as dimensionality reduction, what else can PCA be used for?

A

Noise reduction

137
Q

2 main types of classifiers

A

Generative and Discriminant
Generative: Guassian mixture model (Linear regression, LDA)
Discriminant: Logistic regression

generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution)

If the observed data are truly sampled from the generative model, then fitting the parameters of the generative model to maximize the data likelihood is a common method.

138
Q

p( x | C_k )

p( C_k | x )

A
p( x | C_k ) class conditional density
p( C_k | x ) posterior probability