Anomoly Detection Flashcards
What are anomalies/outliers?
The set of data points that are very different than the remainder of the data
What is the task for looking for anomolies/outliers?
Find all the data points with anomaly scores greater than threshold that you have defined.
What are some applications for Anomaly/Outlier detection?
Fraud detection
Are outliers different from noise data?
Yes. Noise is random error.
Noise should be removed before outlier detection.
Outliers are interesting.
What’s the difference between outlier detection vs novelty detection?
Novelty is eventually
What is a challenge of anomaly detection?
Anomaly detection is unsupervised (like Clustering).
How do you build an anomaly detection?
Build a profile of what is normal and then detect anything that is different
What kind of outliers are there?
Global Outliers
Contextual Outliers
Collective Outliers
What is a global outlier?
A point that significantly deviates from the rest of the data set.
Issue: You need a measurement of how you measure this
What is a contextual outlier?
An outlier that deviates significantly based on selected context
E.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.
What are collective outliers?
Every object doesn’t look like an outlier but when you bring many objects together, it starts to look like an outlier.
Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.
What is a statistical schemes?
The objects are generated by a model.
Identify objects in low probability regions of the model as outliers.
Two types: Parametric/Non-parametric
What is parametric model?
A model that describes the distribution of the data
If something in the model has low probability, then it is an outlier.
Find the mean and the standard deviation.
Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.
What is a limitation of a parametric scheme?
Not always a normal distribution
Can be problematic for high dimensional data
What do you use to model a non-parametric scheme?
A histogram