wk56 Flashcards
What are some common distance measures
Square Euclidian Distance
Euclidian Distance
Manhattan Distance
Describe the K-means algorithm
-Assign random cluster centroids on the vector space
Repeat:
-Assign each data point to the nearest cluster centroid, the resulting assignments results in a newly formed cluster
-Find the mean of the newly formed cluster (the new centroid)
-Update assignments to the cluster if a data point is now closer to a different cluster
-Repeat until max iterations or no change in assignments
What are some limitations of k-means
-they will often find local and require multiple runs to find optimum
-no empirical way to choose the number of clusters that should be used
-will struggle with very high dimensional data, will instead require kernalisation
Which distance measure is used in kernalised k means
squared euclidean distance
How does the kernel trick work
-For calculating the distance of a datapoint to a new mean centroid when the data is created by some higher dimensional function that ordinary k-means cannot account for.
-We expand the square euclidian distance measure and find many dot products in the calculation, we also expand the mean to it’s equation to get rid of that variable in the process.
-Any time there is a dot product, we replace it with the kernal (e.g. polynomial or gaussian kernal). This projects the data into higher dimensional space
-So the data itself is not projected into higher dimensional space, but the distance measures and updates are calculated as if the data was in higher dimensional space
What are the advantages and limitations of kernalised k means
-brings flexibilty to k means
-can class higher dimensional data quickly
Limtations:
-computationally expensive
-high variance (sensitive to initalisation)
-suffers from local minima, may need repeated re-runs
-makes hard assignments, each point is 100% in some category, no nuanced probabilities
What is hierachical clustering
-each object starts as a singleton cluster
-shortest distance between two clusters is found, and those two clusters are combined into a single cluster
-when all objects in one cluster, end
What are the 3 different ways to measure the distance between two clusters
Mean / Single Link:
-select the two closest points between two clusters and compute the distance between them
Max / Complete Link:
-Find the two furthers points away from one another in the two clusters and compute the distance between those two
Average Link:
-Compute the mean point between all the points of the clusters and find the distance between the two (computing centroids)
How to decide which clusters to use?
Usually by eye or industry knowledge, cut off first long ass branch
Why do dimensionality reduction?
-Crazy high dimensions are not plottable or imaginable, plot a 6-dimensional graph and see how that works out for you fam
-Feature extraction, find which features are actually interesting to some task
-Subspace projection (link to point 1), projecting high dimensions to lower space
How do we do projection of high dimensional space to low space
Find some eigen vectors W such that original data Y can be projected as WY=X where W and X are of significant lower dimensions than Y
How do we choose projection (intuitively not formally)
look at the diagram, those with the highest variance in data (calculated as the distance from the mean)
How do we compute the variance of each variable in PCA
1/N * Sum over N for each n: ( y^(n) - mean^(n) )^2
Where n denotes the sample of some attribute (I), variance is calculated per attribute
define covariance matrix
a matrix denoting the variance of each variable (diagonally) and the variance of each variable with respect to the others (I.e their correlation)
What does each column of W represent,
each column, w_d represents a new dimension,. W looks like W = [w_1,w_2, … , w_d ] (each is a matrix)