Anomoly Detection Flashcards
What are anomalies/outliers?
The set of data points that are very different than the remainder of the data
What is the task for looking for anomolies/outliers?
Find all the data points with anomaly scores greater than threshold that you have defined.
What are some applications for Anomaly/Outlier detection?
Fraud detection
Are outliers different from noise data?
Yes. Noise is random error.
Noise should be removed before outlier detection.
Outliers are interesting.
What’s the difference between outlier detection vs novelty detection?
Novelty is eventually
What is a challenge of anomaly detection?
Anomaly detection is unsupervised (like Clustering).
How do you build an anomaly detection?
Build a profile of what is normal and then detect anything that is different
What kind of outliers are there?
Global Outliers
Contextual Outliers
Collective Outliers
What is a global outlier?
A point that significantly deviates from the rest of the data set.
Issue: You need a measurement of how you measure this
What is a contextual outlier?
An outlier that deviates significantly based on selected context
E.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.
What are collective outliers?
Every object doesn’t look like an outlier but when you bring many objects together, it starts to look like an outlier.
Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.
What is a statistical schemes?
The objects are generated by a model.
Identify objects in low probability regions of the model as outliers.
Two types: Parametric/Non-parametric
What is parametric model?
A model that describes the distribution of the data
If something in the model has low probability, then it is an outlier.
Find the mean and the standard deviation.
Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.
What is a limitation of a parametric scheme?
Not always a normal distribution
Can be problematic for high dimensional data
What do you use to model a non-parametric scheme?
A histogram
How do we interpret a histogram for determining anomaly detection?
The ‘long tail’ part of the histogram is considered the anomaly area of the model.
What is a problem with using a histogram for analysis of anomaly detection?
How to set the number of buckets (x-axis) to effectively capture the data
What is a false positive in anomaly detection?
An anomaly that is in fact that an anomaly. (Our histogram is too detailed).
What are the two methods for detecting proximity based outliers?
- Distance-based
2. Density-based
What is distance based outlier?
An object is considered a distance based outlier if it’s neighbourhood doesn’t have enough other points.
What is density based outlier?
An object is considered a density-based outlier if its density is relatively much lower than it’s neighbours
What is LOF method for finding density based outlier?
General idea: For each point, calculate the density of it’s neighbourhood.
Compute: Local Outlier Factor: it’s the average of the ratio of density of the sample p and the density of it’s nearest neighbour
Outliers are the points with low LOF.
How do we measure density?
Density = k / distance to the k-nearest neighbours, or compare with the set of N - nearest neighbours
Can you get a different result for density/distance based outliers?
Yes
If you have clustering, how do you determine if there are outliers?
It doesn’t belong to a cluster.
There is a large distance between an object and it’s cluster.
It belongs to a very small or sparse cluster
What is Case 1: Far from closest cluster way of testing outliers?
Use k-means and build clusters, get an outlier (measure the distance to its closest centre. If it’s distance is higher than average then it is likely an outlier
What is Case 2: Outliers in small clusters way of testing outliers?
Assign a cluster-based local outlier factor.
If p belongs to a large cluster: CBLOF = cluster size * similarity between P and Cluster
If p belongs to a small cluster: CBLOF = cluster size * similarity between p and the closest large cluster
LOW CBLOF scores are suspected outliers`
What is a limitation of a cluster-based method?
High computational cost