wk56 Flashcards

1
Q

What are some common distance measures

A

Square Euclidian Distance
Euclidian Distance
Manhattan Distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the K-means algorithm

A

-Assign random cluster centroids on the vector space
Repeat:
-Assign each data point to the nearest cluster centroid, the resulting assignments results in a newly formed cluster
-Find the mean of the newly formed cluster (the new centroid)
-Update assignments to the cluster if a data point is now closer to a different cluster
-Repeat until max iterations or no change in assignments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some limitations of k-means

A

-they will often find local and require multiple runs to find optimum
-no empirical way to choose the number of clusters that should be used
-will struggle with very high dimensional data, will instead require kernalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which distance measure is used in kernalised k means

A

squared euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does the kernel trick work

A

-For calculating the distance of a datapoint to a new mean centroid when the data is created by some higher dimensional function that ordinary k-means cannot account for.
-We expand the square euclidian distance measure and find many dot products in the calculation, we also expand the mean to it’s equation to get rid of that variable in the process.
-Any time there is a dot product, we replace it with the kernal (e.g. polynomial or gaussian kernal). This projects the data into higher dimensional space
-So the data itself is not projected into higher dimensional space, but the distance measures and updates are calculated as if the data was in higher dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the advantages and limitations of kernalised k means

A

-brings flexibilty to k means
-can class higher dimensional data quickly
Limtations:
-computationally expensive
-high variance (sensitive to initalisation)
-suffers from local minima, may need repeated re-runs
-makes hard assignments, each point is 100% in some category, no nuanced probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is hierachical clustering

A

-each object starts as a singleton cluster
-shortest distance between two clusters is found, and those two clusters are combined into a single cluster
-when all objects in one cluster, end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 3 different ways to measure the distance between two clusters

A

Mean / Single Link:
-select the two closest points between two clusters and compute the distance between them
Max / Complete Link:
-Find the two furthers points away from one another in the two clusters and compute the distance between those two
Average Link:
-Compute the mean point between all the points of the clusters and find the distance between the two (computing centroids)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to decide which clusters to use?

A

Usually by eye or industry knowledge, cut off first long ass branch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why do dimensionality reduction?

A

-Crazy high dimensions are not plottable or imaginable, plot a 6-dimensional graph and see how that works out for you fam
-Feature extraction, find which features are actually interesting to some task
-Subspace projection (link to point 1), projecting high dimensions to lower space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we do projection of high dimensional space to low space

A

Find some eigen vectors W such that original data Y can be projected as WY=X where W and X are of significant lower dimensions than Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we choose projection (intuitively not formally)

A

look at the diagram, those with the highest variance in data (calculated as the distance from the mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we compute the variance of each variable in PCA

A

1/N * Sum over N for each n: ( y^(n) - mean^(n) )^2
Where n denotes the sample of some attribute (I), variance is calculated per attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

define covariance matrix

A

a matrix denoting the variance of each variable (diagonally) and the variance of each variable with respect to the others (I.e their correlation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does each column of W represent,

A

each column, w_d represents a new dimension,. W looks like W = [w_1,w_2, … , w_d ] (each is a matrix)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the PCA Process (detail)

A

-First, we centre the data on the mean/origin (y_i - mu_i) for each data point
-constrain w^T.w to equal 1
-we want to find Var( matrix(y), matrix(w)) and argmax it for w
-expanding out var(YW) we get W^T.Y^T.Y.W^T which simplifies to w^T Covariance Matrix W
-after some magic we get:
argmax_w (W^T covariance W - lambda(W^T W -1))
-Differentiating it we get:
Covariance W = Lambda W
-Multiply both sides by w^T and we get:
w_i^T Covariance w_i = lambda_i, so lambda_i is the eigen value and variance
-select highest variance / lamda / eigen value for pc 1, second highest for pc2 etc

17
Q

What is PCA in simple terms

A

We project the data onto a lower dimensional subspace and find the subspace such that we maximise the variance of the newly projected data
-each following principle component’s subspace is orthogonal to the next in order to get maximum uniqueness in features and data

18
Q

How do you choose how many dimensions to use for PCA

A

Similar to k-means, looking at results, domain knowledge, no real empirical means

19
Q

What are some limitations of principle component analysis

A

-Data must be real values
-Can’t have any gaps in data

20
Q

Explain the process of kernalising PCA

A

Assume data is mean-centred. Recall equation:
-Covariance W = Lambda W
-expand covariance and you will see that the eigenvector can be found just by the linear combination of the projection of y_n and w_i. Then the projection of y_n
-after further expansion, we can kernalise all the dot products and find that solving for the variance is as in the image

21
Q

We have previously assumed that the projected matrix of data is centred. What is the equation to centre the matrix

A

see image:
Kernal matrix - 2(1/N matrix)Kernal Matrix + 1/N Matrix K 1/N Matrix

22
Q

What are the full steps for kernalised PCA

A

1) Select a kernel function
2) Kernel matrix computation, compute kernal K
3) centre the kernel matrix
4) solve eigen problem for centred kernel matrix K to obtain eigenvectors and eigenvalues
5) sort the eigenvalues and eigenvectors from largest to smallest by eigenvalue
6) compute the projections (aka scores) for each point (see image)