Anomoly Detection Flashcards

1
Q

What are anomalies/outliers?

A

The set of data points that are very different than the remainder of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the task for looking for anomolies/outliers?

A

Find all the data points with anomaly scores greater than threshold that you have defined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some applications for Anomaly/Outlier detection?

A

Fraud detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Are outliers different from noise data?

A

Yes. Noise is random error.
Noise should be removed before outlier detection.
Outliers are interesting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the difference between outlier detection vs novelty detection?

A

Novelty is eventually

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a challenge of anomaly detection?

A

Anomaly detection is unsupervised (like Clustering).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you build an anomaly detection?

A

Build a profile of what is normal and then detect anything that is different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kind of outliers are there?

A

Global Outliers
Contextual Outliers
Collective Outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a global outlier?

A

A point that significantly deviates from the rest of the data set.

Issue: You need a measurement of how you measure this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a contextual outlier?

A

An outlier that deviates significantly based on selected context

E.g Is 40 degrees Celsius an outlier? In winter, yes. In summer, no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are collective outliers?

A

Every object doesn’t look like an outlier but when you bring many objects together, it starts to look like an outlier.

Example: Sports/team: A good player Neymar is just like Messi or Ronaldo. But when you put them together with a good team they become an anomaly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a statistical schemes?

A

The objects are generated by a model.

Identify objects in low probability regions of the model as outliers.

Two types: Parametric/Non-parametric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is parametric model?

A

A model that describes the distribution of the data

If something in the model has low probability, then it is an outlier.

Find the mean and the standard deviation.
Check each the difference from the average. If it is greater than a threshold, then it is an anomaly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a limitation of a parametric scheme?

A

Not always a normal distribution

Can be problematic for high dimensional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do you use to model a non-parametric scheme?

A

A histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we interpret a histogram for determining anomaly detection?

A

The ‘long tail’ part of the histogram is considered the anomaly area of the model.

17
Q

What is a problem with using a histogram for analysis of anomaly detection?

A

How to set the number of buckets (x-axis) to effectively capture the data

18
Q

What is a false positive in anomaly detection?

A

An anomaly that is in fact that an anomaly. (Our histogram is too detailed).

19
Q

What are the two methods for detecting proximity based outliers?

A
  1. Distance-based

2. Density-based

20
Q

What is distance based outlier?

A

An object is considered a distance based outlier if it’s neighbourhood doesn’t have enough other points.

21
Q

What is density based outlier?

A

An object is considered a density-based outlier if its density is relatively much lower than it’s neighbours

22
Q

What is LOF method for finding density based outlier?

A

General idea: For each point, calculate the density of it’s neighbourhood.

Compute: Local Outlier Factor: it’s the average of the ratio of density of the sample p and the density of it’s nearest neighbour

Outliers are the points with low LOF.

23
Q

How do we measure density?

A

Density = k / distance to the k-nearest neighbours, or compare with the set of N - nearest neighbours

24
Q

Can you get a different result for density/distance based outliers?

A

Yes

25
Q

If you have clustering, how do you determine if there are outliers?

A

It doesn’t belong to a cluster.
There is a large distance between an object and it’s cluster.
It belongs to a very small or sparse cluster

26
Q

What is Case 1: Far from closest cluster way of testing outliers?

A

Use k-means and build clusters, get an outlier (measure the distance to its closest centre. If it’s distance is higher than average then it is likely an outlier

27
Q

What is Case 2: Outliers in small clusters way of testing outliers?

A

Assign a cluster-based local outlier factor.

If p belongs to a large cluster: CBLOF = cluster size * similarity between P and Cluster

If p belongs to a small cluster: CBLOF = cluster size * similarity between p and the closest large cluster

LOW CBLOF scores are suspected outliers`

28
Q

What is a limitation of a cluster-based method?

A

High computational cost