Anomaly Detection + Distances Flashcards

Question 1

Q

What is DTW and how does it work?

Answer

A

Dynamic Type Warping or sequence alignment aligns time-series non-linearly in time, and finds the best match.

Question 2

Q

How can you create an Isolation forrest?

Answer

A

Repeat N times:
* Randomly pick a feature (dimension) f
* Split said f at random between [min, max]
* Continue until all leafs contain singletons
* The path length to reach a leaf is the isolation score
* Average this length over all trees to get the anomaly score

Question 3

Q

What are the two main ways to approach anomaly detection?

Answer

A

Distance-based: a point is anomalous when it is far from other points
Density-based: a point is anomalous when it is in a low density region

Question 4

Q

How to compute local reachability density (lrd)?

Answer

A

lrd = #nn/sum of distances from point to the neighbours

Question 5

Q

How to compute local outlier factor (lof)?

Answer

A

lof = (1/#nn) * (sum of lrd of neighbours)/lrd of point

Question 6

Q

What is a residual?

Answer

A

The difference between the expected and real value

Question 7

Q

What is PCA

Answer

A

Principal Component Analysis reduces the dimension of data by finding the principal vectors that capture most of the data’s patterns

Question 8

Q

Why is PCA useful for anomaly detection?

Answer

A

Outliers have variability in the smallest component (that should be constant)
The set of feature it finds should explain most of the variance. If Datapoints vary in unexplained dimensions are anomalous.

Question 9

Q

How does dimensionality reduction find outliers in general?

Answer

A

Abnormal data will map to abnormal code, so when reconstructing, this will give us a more clear view of the anomalies.

Question 10

Q

What is edit distance?

Answer

A

The amount of changes required to a series to be equal to another series.

Question 11

Q

What are the properties of a distance metric?

Answer

A

d > 0
d(a,b) = 0 iff a == b
d(a,b) = d(b,a)
d(a,c) <= d(a,b) + d(b,c)

Question 12

Q

How does K-means work?

Answer

A

Randomly choose cluster centers for how many clusters you want.

Find the nearest neighbours -> compute centroid and assign each observation to the cluster with the closest center to the observation.

Iterate until the cluster assignments stop changing.

Question 13

Q

How does hierarchical clustering work?

Answer

A

Treat each observation as a cluster.
Find the closest two clusters and merge. Repeat until all points belong to one cluster.
Create dendrogram -> remove links from top until you have as many clusters as you want.

Question 14

Q

What are the types of cluster distance?

Answer

A

Single linkage: distance between closest points from the different clusters

Compllete linkage: distance between the furthest points in the clusters

Average linkage: average distance between all pairs of points.

Ward distance: difference between the total within cluster sum of squares for the two clusters separately, and the within cluster ss result from merging the two clusters in one cluster

Question 15

Q

Does hierarchical clustering need the distance to be a metric?

Answer

A

Single, complete, and average linkage only require symmetry and non-negative properties

Ward and centroid-based require euclidean distance since they minimize squared error.

Question 16

Q

How does DBSCAN work?

Answer

Study These Flashcards

A

Classify points as Core, Border, and Noise, based on a radius e and a t expected neighbours.

Core: if number of points at distance e or less is >= t but has core in circle

Border: if -//- is < t but has no core in circle

Only requires symmetry and non-negative

Question 17

Q

Graph based (spectral) clustering

Answer

Study These Flashcards

A

Construct neighbourhood graph G, based on the distance of the points being below a threshold, or nearest neighbors. Set the weight on the edges as the distance. Find communities using graph mining. Return found communitie as clusters.

Question 18

Q

When is a clustering considered good in terms of inter and intra - cluster distances?

Answer

Study These Flashcards

A

When the Intra/Inter ratio is small.

Question 19

Q

What is CURE?

Answer

Study These Flashcards

A

Cluster using representatives is useful for non-centroid clustering. Since is it very expensive, we can select a couple of points to represent a cluster (prototyping)

Question 20

Q

What does BFR do?

Answer

Study These Flashcards

A

aimts to minimize time and space required by l-means
points can be assigned to three sets:
> discard set - assigned to a pre-existing cluster
> compressed set - mini-clusters
> retained set - outliers, points that do not belong to any cluster

Question 21

Q

What do we require for a cluster in BFR?

Answer

Study These Flashcards

A

N - the number of data points in the cluster
SUM - a vector containing with sums of all feature values
SUMSQ - a vector with sums of squared feature values

With these we can keep track of the centroid of the cluster, and compute a threshold for cluster membership.

Question 22

Q

What are the 3 types of anomalyes?

Answer

Study These Flashcards

A

Point: a singular datapoint is anomalous w.r.t to the rest of the data.
Contextual: Detectable depending on context. Using sliding windows and the assumption that the next datapoint should be close to the previous, large leaps are considered anomalous
Collective: The individual instances within the collective anomaly are not anomalour by themselves. They require a relationship among data instances (sequential, spatial, graph). Also detected with sliding windows.

Question 23

Q

What is hamming distance?

Answer

Study These Flashcards

A

The number of different bits in two bit strings.

Question 24

Q

What does the silhouette 1 and -1 values correlate with.

Answer

Study These Flashcards

A

1 is good clustering, -1 is mixed

Anomaly Detection + Distances Flashcards

(24 cards)