week 2 Flashcards
distance metric properties
non-negativity, triangle inequality, symmetry, identity
distance between sets
Jaccard simmilarity = intersection / union
Jaccard distance : 1 - jaccard simmilarity
distance between vectors
Euclidean - is distance metric, used on continuous data, not extreme
dimensionality, magnitude matters, lower bound on maps
- avoid when high dimensionality, sparse/discrete data, when
magnitude matters
manhattan -
chebyshev - chess distance
hamming - counts the number of dissimilar bits
match-based - divide each dimension in k bins, distance is 0 if points
are not in the same bin, lessens efect of noise in higher
dimensions
cosine - angle between 2 vectors, used in bag of words, scale invariant
but not translate invariant, depends on their direction rather
than length
ISOMAP - knn for each data point, weighted graph where weights =
distance, compute shortest path between points in graph,
expensive, disconnected components, sensitive to noise
distance between distributions
KL divergence - measures surprise when using Q instead of P, not symmetric, to avoid 0 probabilities when computing this use smoothing
distance between sequences
dynamic time warping - build matrix of distances between points in the 2 sequences, take shortest path from top left to bottom right (without going up or left), length = distance, algorithm is dynamic programming, not a distance metric
edit distance - number of operations from A to B, is a distance metric
dimensionality reduction
PCA - assume linearity, largest variance = most informative is not always true, principle components are linearly independent, not always true, can be statistically correlated.
Manifold learning - for visualization, assumes data lies on a low-dimensional manifold
t-SNE - pulls together neighboring points, while pushing apart distant points => global distances are fucked => only for visualization, uses KL divergence and gradient descent
UMAP - k-neighbour-based graph learning algorithm, build graph of high-dimensional data based on user-specified number of neighbours, then find similar graph in lower dimension
clustering
k-means - compute new clusters, minimize WCV (SSE), needs euclidean distance because new centroid is mean
hierarchical
DBSCAN - density based, give radius and threshold, points is core if nr of points > threshold in give radius, border if nr of points < threshold but one of the points is core, else noise
clustering has no objective function to optimise, but 2 heuristics:
maximize inter-cluster distance, minimize intra-cluster distance
external validation - confusion matrix => some metrics can be used
CURE
-run expensive algorithm on batch
-choose n (spread out) representatives and move them towards cluster centroid
-assign all points to closest representatives
needs euclidean distance (move representatives towards cluster centroid)
BFR
-choose k random points, initialize clusters, process new batch, assign points that are within thresholds to corresponding clusters, compute statistics for new clusters (mean, new centroid), discard these points, only remember outliers, at each iteration check whether outliers now belong to a cluster, if so recompute statistics and delete them
-needs euclidean distance (centroid reassignment)
-saves time and space bcs only remembers statistics, deletes assigned points