Anomoly Detection Flashcards by Ben Rodgers

What are anomalies/outliers?

The set of data points that are very different than the remainder of the data

How well did you know this?

Not at all

Perfectly

What is the task for looking for anomolies/outliers?

Find all the data points with anomaly scores greater than threshold that you have defined.

How well did you know this?

Not at all

Perfectly

What are some applications for Anomaly/Outlier detection?

Fraud detection

How well did you know this?

Not at all

Perfectly

Are outliers different from noise data?

Yes. Noise is random error.
Noise should be removed before outlier detection.
Outliers are interesting.

How well did you know this?

Not at all

Perfectly

What’s the difference between outlier detection vs novelty detection?

Novelty is eventually

How well did you know this?

Not at all

Perfectly

What is a challenge of anomaly detection?

Anomaly detection is unsupervised (like Clustering).

How well did you know this?

Not at all

Perfectly

How do you build an anomaly detection?

Build a profile of what is normal and then detect anything that is different

How well did you know this?

Not at all

Perfectly

What kind of outliers are there?

Global Outliers
Contextual Outliers
Collective Outliers

How well did you know this?

Not at all

Perfectly

What is a global outlier?

A point that significantly deviates from the rest of the data set.

Issue: You need a measurement of how you measure this

How well did you know this?

Not at all

Perfectly

What is a contextual outlier?

An outlier that deviates significantly based on selected context

E.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.

How well did you know this?

Not at all

Perfectly

What are collective outliers?

Every object doesn’t look like an outlier but when you bring many objects together, it starts to look like an outlier.

Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.

How well did you know this?

Not at all

Perfectly

What is a statistical schemes?

The objects are generated by a model.

Identify objects in low probability regions of the model as outliers.

Two types: Parametric/Non-parametric

How well did you know this?

Not at all

Perfectly

What is parametric model?

A model that describes the distribution of the data

If something in the model has low probability, then it is an outlier.

Find the mean and the standard deviation.
Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.

How well did you know this?

Not at all

Perfectly

What is a limitation of a parametric scheme?

Not always a normal distribution

Can be problematic for high dimensional data

How well did you know this?

Not at all

Perfectly

What do you use to model a non-parametric scheme?

A histogram

How well did you know this?

Not at all

Perfectly

How do we interpret a histogram for determining anomaly detection?

Study These Flashcards

The ‘long tail’ part of the histogram is considered the anomaly area of the model.

What is a problem with using a histogram for analysis of anomaly detection?

Study These Flashcards

How to set the number of buckets (x-axis) to effectively capture the data

What is a false positive in anomaly detection?

Study These Flashcards

An anomaly that is in fact that an anomaly. (Our histogram is too detailed).

What are the two methods for detecting proximity based outliers?

Study These Flashcards

Distance-based

2. Density-based

What is distance based outlier?

Study These Flashcards

An object is considered a distance based outlier if it’s neighbourhood doesn’t have enough other points.

What is density based outlier?

Study These Flashcards

An object is considered a density-based outlier if its density is relatively much lower than it’s neighbours

What is LOF method for finding density based outlier?

Study These Flashcards

General idea: For each point, calculate the density of it’s neighbourhood.

Compute: Local Outlier Factor: it’s the average of the ratio of density of the sample p and the density of it’s nearest neighbour

Outliers are the points with low LOF.

How do we measure density?

Study These Flashcards

Density = k / distance to the k-nearest neighbours, or compare with the set of N - nearest neighbours

Can you get a different result for density/distance based outliers?

Study These Flashcards

Yes

If you have clustering, how do you determine if there are outliers?

It doesn't belong to a cluster. There is a large distance between an object and it's cluster. It belongs to a very small or sparse cluster

What is Case 1: Far from closest cluster way of testing outliers?

Use k-means and build clusters, get an outlier (measure the distance to its closest centre. If it's distance is higher than average then it is likely an outlier

What is Case 2: Outliers in small clusters way of testing outliers?

Assign a cluster-based local outlier factor. If p belongs to a large cluster: CBLOF = cluster size * similarity between P and Cluster If p belongs to a small cluster: CBLOF = cluster size * similarity between p and the closest large cluster LOW CBLOF scores are suspected outliers`

What is a limitation of a cluster-based method?

High computational cost

Anomoly Detection Flashcards

(28 cards)