Chapter 24 One-Class Classification Flashcards

1
Q

What’s One Class Classification?

P 306

A

Identifying outliers in data is referred to as outlier or anomaly detection and a subfield of machine learning focused on this problem is referred to as one-class classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In what type of imbalanced datasets can One Class Classification be useful? Give 2 examples

P 306

A

Although not designed for these types of problems, one-class classification algorithms can be effective for imbalanced classification datasets where
1- there are none or very few examples of the minority class (such as a few tens of examples or fewer P 308)
2- datasets where there is no coherent structure to separate the classes that could be learned by a supervised algorithm

To be clear, this adaptation of one-class classification algorithms for imbalanced classification is unusual but can be effective on some problems. P 308

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does One Class Classification work?

P 307

A

One-Class Classification, or OCC for short, involves fitting a model on the normal data and predicting whether new data is normal or an outlier/anomaly.

Given the nature of the approach, one-class classifications are most suited for those tasks where the positive cases don’t have a consistent pattern or structure in the feature space, making it hard for other classification algorithms to learn a class boundary. Instead, treating the positive cases as outliers, it allows one-class classifiers to ignore the task of discrimination and instead
focus on deviations from normal or what is expected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One must remember that the advantages of one-class classifiers come at a price of discarding all of available information about the minority class. Therefore, this solution should be used carefully and may not fit some specific applications. True/False

P 308

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The scikit-learn library provides an implementation of one-class SVM in the ____ class.

P 310

A

OneClassSVM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the difference between the standard SVM and OneClassSVM?

P 310

A

The main difference from a standard SVM is that OneClassSVM is fit in an unsupervised manner and does not provide the normal hyperparameters for tuning the margin like C. Instead, it provides a hyperparameter nu that controls the sensitivity of the support vectors and should be tuned to the approximate ratio of outliers in the data, e.g. 0.01%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does Isolation Forest work?

P 312

A

It is based on modeling the normal data in such a way to isolate anomalies that are both few in number and different in the feature space.

a tree structure can be constructed effectively to isolate every single instance. Because of their susceptibility to isolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Isolation Forest (iForest) detects anomalies by employing distance or density measures. True/False

P 312

A

False, Isolation Forest (iForest) detects anomalies purely based on the concept of isolation without employing any distance or density measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are 2 most important hyperparameters of IsolationForest class in python?

P 312

A

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class. Perhaps the most important hyperparameters of the model are:

  1. the n_estimators argument that sets the number of trees to create
  2. the contamination argument, which is used to help define the number of outliers in the dataset (as a fraction).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is Isolation Forest an unsupervised method?

External

A

It is important to mention that Isolation Forest is an unsupervised machine learning algorithm. Meaning, there is no actual “training” or “learning” involved in the process and there is no pre-determined labeling of “outlier” or “not-outlier” in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s Minimum Covariance Determinant and for what kind of distribution is it used?

P 313

A

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers. For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution. This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The scikit-learn library provides access to Minimum Covariance Determinant method via the ____ class. It provides the ____ argument that defines the expected ratio of outliers to be observed in practice.

P 314

A

EllipticEnvelope, contamination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does LocalOutlierFactor work? What is its weakness?

P 316

A

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space. This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class. The model can be defined and requires that the expected percentage of outliers in the dataset be indicated (Arg. contamination)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OneClassSVM can be fit on all examples in the training dataset or just those examples in the majority class. True/False

P 310

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly