Anomaly Detection + Distances Flashcards
What is DTW and how does it work?
Dynamic Type Warping or sequence alignment aligns time-series non-linearly in time, and finds the best match.
How can you create an Isolation forrest?
Repeat N times:
* Randomly pick a feature (dimension) f
* Split said f at random between [min, max]
* Continue until all leafs contain singletons
* The path length to reach a leaf is the isolation score
* Average this length over all trees to get the anomaly score
What are the two main ways to approach anomaly detection?
- Distance-based: a point is anomalous when it is far from other points
- Density-based: a point is anomalous when it is in a low density region
How to compute local reachability density (lrd)?
lrd = #nn/sum of distances from point to the neighbours
How to compute local outlier factor (lof)?
lof = (1/#nn) * (sum of lrd of neighbours)/lrd of point
What is a residual?
The difference between the expected and real value
What is PCA
Principal Component Analysis reduces the dimension of data by finding the principal vectors that capture most of the data’s patterns
Why is PCA useful for anomaly detection?
Outliers have variability in the smallest component (that should be constant)
The set of feature it finds should explain most of the variance. If Datapoints vary in unexplained dimensions are anomalous.
How does dimensionality reduction find outliers in general?
Abnormal data will map to abnormal code, so when reconstructing, this will give us a more clear view of the anomalies.
What is edit distance?
The amount of changes required to a series to be equal to another series.
What are the properties of a distance metric?
d > 0
d(a,b) = 0 iff a == b
d(a,b) = d(b,a)
d(a,c) <= d(a,b) + d(b,c)
How does K-means work?
Randomly choose cluster centers for how many clusters you want.
Find the nearest neighbours -> compute centroid and assign each observation to the cluster with the closest center to the observation.
Iterate until the cluster assignments stop changing.
How does hierarchical clustering work?
Treat each observation as a cluster.
Find the closest two clusters and merge. Repeat until all points belong to one cluster.
Create dendrogram -> remove links from top until you have as many clusters as you want.
What are the types of cluster distance?
Single linkage: distance between closest points from the different clusters
Compllete linkage: distance between the furthest points in the clusters
Average linkage: average distance between all pairs of points.
Ward distance: difference between the total within cluster sum of squares for the two clusters separately, and the within cluster ss result from merging the two clusters in one cluster
Does hierarchical clustering need the distance to be a metric?
Single, complete, and average linkage only require symmetry and non-negative properties
Ward and centroid-based require euclidean distance since they minimize squared error.