week 2 Flashcards

Question 1

Q

distance metric properties

Answer

A

non-negativity, triangle inequality, symmetry, identity

Question 2

Q

distance between sets

Answer

A

Jaccard simmilarity = intersection / union
Jaccard distance : 1 - jaccard simmilarity

Question 3

Q

distance between vectors

Answer

A

Euclidean - is distance metric, used on continuous data, not extreme
dimensionality, magnitude matters, lower bound on maps
- avoid when high dimensionality, sparse/discrete data, when
magnitude matters
manhattan -
chebyshev - chess distance
hamming - counts the number of dissimilar bits
match-based - divide each dimension in k bins, distance is 0 if points
are not in the same bin, lessens efect of noise in higher
dimensions
cosine - angle between 2 vectors, used in bag of words, scale invariant
but not translate invariant, depends on their direction rather
than length
ISOMAP - knn for each data point, weighted graph where weights =
distance, compute shortest path between points in graph,
expensive, disconnected components, sensitive to noise

Question 4

Q

distance between distributions

Answer

A

KL divergence - measures surprise when using Q instead of P, not symmetric, to avoid 0 probabilities when computing this use smoothing

Question 5

Q

distance between sequences

Answer

A

dynamic time warping - build matrix of distances between points in the 2 sequences, take shortest path from top left to bottom right (without going up or left), length = distance, algorithm is dynamic programming, not a distance metric

edit distance - number of operations from A to B, is a distance metric

Question 6

Q

dimensionality reduction

Answer

A

PCA - assume linearity, largest variance = most informative is not always true, principle components are linearly independent, not always true, can be statistically correlated.
Manifold learning - for visualization, assumes data lies on a low-dimensional manifold
t-SNE - pulls together neighboring points, while pushing apart distant points => global distances are fucked => only for visualization, uses KL divergence and gradient descent
UMAP - k-neighbour-based graph learning algorithm, build graph of high-dimensional data based on user-specified number of neighbours, then find similar graph in lower dimension

Question 7

Q

clustering

Answer

A

k-means - compute new clusters, minimize WCV (SSE), needs euclidean distance because new centroid is mean
hierarchical
DBSCAN - density based, give radius and threshold, points is core if nr of points > threshold in give radius, border if nr of points < threshold but one of the points is core, else noise
clustering has no objective function to optimise, but 2 heuristics:
maximize inter-cluster distance, minimize intra-cluster distance
external validation - confusion matrix => some metrics can be used

Question 8

Q

CURE

Answer

A

-run expensive algorithm on batch
-choose n (spread out) representatives and move them towards cluster centroid
-assign all points to closest representatives
needs euclidean distance (move representatives towards cluster centroid)

Question 9

Q

BFR

Answer

A

-choose k random points, initialize clusters, process new batch, assign points that are within thresholds to corresponding clusters, compute statistics for new clusters (mean, new centroid), discard these points, only remember outliers, at each iteration check whether outliers now belong to a cluster, if so recompute statistics and delete them
-needs euclidean distance (centroid reassignment)
-saves time and space bcs only remembers statistics, deletes assigned points

week 2 Flashcards

(9 cards)