3 ML Models of Dispersion (PCA, Clustering, Anomalies) Flashcards

1
Q

Do visualizations give quantifiable insights?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is dispersion?

A

Dispersion indicates how much variation there is in some dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Number of Distinct Data Points
Radius of Minimum Enclosing Sphere
Average Square Euclidean Distance from the Dataset Mean m:

A

Various possible measures of dispersion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Principal Component Analysis

A

Principal Component Analysis (PCA) is a statistical technique that transforms data to identify the directions (principal components) that maximize variance while minimizing reconstruction error, thereby reducing dimensionality while retaining essential information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it important to maximize dispersion in PCA

A

Dispersion = Variance = Information, we want to retain as much information as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The Lagrange Multiplier Method

A

The Lagrange multiplier method is used in PCA to maximize the variance of data projections onto a direction
u, but to the constraint that u is a unit vector, by incorporating the constraint into the optimization of the variance function through the Lagrange function.

In PCA, the constraint is that the direction vector u must be a unit vector, which mathematically can be expressed as ∥u∥^2=1 (or equivalently, |u⊤u|=1). This ensures that the projections of the data maintain their relative scales and only capture the directionality of the data’s variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do I rank principal components?

A

Based on lambda (the eigenvalue) from the Langrangian formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is PCA able to work with outliers?

A

No. Must clean outliers before doing PCA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data must be centered to capture true PCA direction, how is this done?

A

Set mean to zero by subtracting the mean of each feature from the data points so that each feature has a mean of 0. (Different from standardization because the standard deviation after centering isn’t necessarily 1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does PCA work for strongly non-Gaussian data?

A

No. Needs data centering to work in that data won’t vary locally in different directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PCA process for multiple components

A

You take a vector x and identify a direction in which the data varies most (the principal component). This direction can also be represented as a vector and can be broken down into:

  1. What PCA can capture
  2. Residual (what PCA can’t capture)

From the residue, a secondary principal component can be found within. Each additional component captures more information from the original data that the previous components missed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Code for directly computing all eigenvectors/eigenvalues and sorting them by decreasing eigenvalues for PCA in Python (the way with Covariance Matrix (classical way))

A

numpy.linalg.eigh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

PCA Biplot

A

Visual way to look at data using PCA results.

Uses 2 principal components (directions) that explain the most variation in the data. The two directions are used for x and y axis. Each data point on graph represents an instance from original data.

For each feature (column) in original data, an arrow is drawn in the direction that the feature most influences. This is called “loading vector”. One feature can be given color instead of loading vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Generalized Variance in PCA

A

The total dispersion of the data. It can also be called Trace.

It is the sum of all diagonal elements (eigenvalues) in the covariance matrix (which captures covariances between features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

? = Sum of all Eigenvalues in Covariance Matrix

A

Trace (total variance captured in dataset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How would a biplot with two PC’s with eigenvalues that equate to 80% and 15% of the total dispersion respectively look?

A

It’d be a nearly linear line.

This is because one direction responsible for most of the information (variance) so the other one is pretty much dependent on that direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How would a biplot with two PC’s with eigenvalues that equate to 45% and 45% of the total dispersion respectively look?

A

It’d be a scattered ball like thing with no correlation.

This is because no direction responsible for most of the information (variance) so they are both pretty independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

PCA Scree Plot

A

Bar plot with each bar being a principal component. Y axis is explained variance (eigenvalue) and X is component name. (But can also be cumulative plot sometimes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is artifact removal?

A

Sometimes the Principal Component giving off the most information isn’t relevant to the problem: ie eyes blinking is EEG recordings trying to extract neuronal activity.

In this case we ignore that principal component. Artifact indicates it had a high eigenvalue in comparison to the total dispersion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is denoising?

A

Ignoring principal components with small eigenvalues to focus on the main components involved in describing the data variation. (Remove noise)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Problem with Covariance Matrix

A

Covariance matrix requires dimensions dxd (where d is number of features (columns) in data set). For lots of datatypes (images or biological samples), the number of features can exceed 10,000. This is super costly to store in memory.

SVD is efficient alternative to avoid making covariance matrix, when rows is less than columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Code for directly computing all eigenvectors/eigenvalues and sorting them by decreasing eigenvalues for PCA in Python (the way with Singular Value Decomposition (efficient way))

A

numpy.linalg.svd(M, full_matrices=False)

‘full_matrices = False’ reduces number of eigenvectors (components) to just those that don’t have an eigenvalue of 0.

23
Q

When to use SVD instead of Covariance Matrix

A

When there are more columns (features) than rows (observations) in the large dataset.

24
Q

Power Iteration Technique

A

POWIT is an iterative technique used to find the dominant eigenvector of a matrix.

25
Q

Anomalies

A

Points that escape a model’s prediction capability

26
Q

Classical approach for anomaly detection

A

Z-score = (x - mean) / Standard Deviation, where x is data point

Z between -1 and 1 is not unusual. Between —3,-2 and 2,3 is moderately unusual. Beyond abs(3) is an outlier.

27
Q

Mahalanobis Distance

A

Measures distance between a data point and the center of a distribution when there are multiple correlated variables. (Ex: when measuring weight and height as they can have correlations)

Z = √(((x-mean)^T)(inverse covariance matrix)(x-mean))

28
Q

Limits of Mahalanobis Distance

A
  1. In high dimensions (more features than datapoints), sample correlations that don’t reflect the true population correlation arise in covariance matrix. The Mahalanobis Distance will have inflated anomaly scores. (Can be fixed by adding small diagonal term to covariance matrix to make anomaly score more robust)
  2. When distribution is non-Gaussian, Mahalanobis doesn’t describe data well
29
Q

Boundary Based Anomaly Detection

A

Focused on finding a boundary around most normal data points, such that any points outside the boundary are considered anomalies.

30
Q

Objective function formula for finding a minimum enclosing object (boundary based anomaly detection), in other words, encloses the majority of the data points.

A

S = size of shape (radius if shape is sphere)
c = center of the boundary
Ei = slack variables that allow some points to fall outside the boundary (non-negative)
C = parameter that controls the trade-off between a tight boundary and some points to fall outside

We want to minimize S + C(sum Ei)

31
Q

What are Karush-Kuhn-Tucker (KKT) conditions in boundary based anomaly detection?

A

Necessary conditions for a solution to be optimal in a constrained optimization problem.

  1. Stationarity (solution minimizes function without any possible further reduction)
  2. Primal Feasibility (solution meets all the problem’s original constraints, ex: must stay within limits of weight and volume set in problem, must be non-negative)
  3. Dual Feasibility (penalties apply if constraints are tight, ex: penalty (weight in Lagrange function) applied if value is close to boundary)
  4. Complementary Slackness (ensures only the active constraints influence the solution)
32
Q

Constrained optimization problem with boundary based anomaly detection (where KKT framework conditions apply):

A

L(ø,lamda) =f(ø) + ∑(lambda*gi(ø)),

f(ø) is objective function we want to minimize (S + C(sum Ei)) and gi(ø) is for each variable in which a constraint applies (ie weight, height).

gi(ø) = h(xi,c) - S - Ei
- H = distance between point and center (it must be less than the sum of size of the shape and total slack)

Lambda is Lagrange multiplier (weight applied to balance objective function and constraints).

33
Q

Lagrange Dual approach to performing dual optimization of L(ø,lamda) in boundary based anomaly detection:

A

First, Minimize L with respect to ø (minimum shape size with max information)

Then, maximize with respect to lambda (Lagrange multipliers) to ensure all constraints are fulfilled

34
Q

Lagrange Primal Approach to performing optimization of boundary based anomaly detection:

A

Solve for ø directly, aiming to minimize f(ø) while keeping constraints (KTT) satisfied as I go along.

35
Q

SVDD: Support Vector Data Description

A

A form of boundary based anomaly detection that uses a spherical boundary to encapsulate normal data.

36
Q

Why solve Dual Optimization Problem instead of Primal for SVDD

A

Primal optimization problem has 1+N+d parameters, meanwhile dual only has alpha (Lagrange multiplier less than or equal to N)

N = number of data points
d = number of dimensions (features)
a = non-zero lambda (Lagrange multiplier, weights) that represent influence of each data point on boundary (a is only non zero when close to the boundary)

More parameters = more computationally expensive

37
Q

What parameters can be recovered from the Dual Optimization problem for SVDD?

A

Center of the hypersphere can be recovered by the KKT stationary condition dL/dc = 0. This implies c = ∑(ai*xi), which is weighted average of points in the dataset

Radius of the hypersphere (R, where R^2=S) can be recovered from the KKT complementary slackness equation R = ||c-xi||

38
Q

What is better (more robust) for skewed data: Mahalanobis (mean vector) or SVDD (center of hypersphere)?

A

Support vector data description

This is due to the ability of SVDD to focus mainly on the border of the distribution and not the internal distribution within the borders.

39
Q

On a high dimensional image, which would be more blurry?: Mahalanobis (mean vector) or SVDD (center of hypersphere)?

A

SVDD is more blurry.

This is due to focusing on border of distribution, which creates a softer boundary around data points compared to the tighter boundary that Mahalanobis distance based methods establish.

40
Q

Weakness of SVDD besides its slight blurriness (in images)

A

Boundary of SVDD is greatly distorted by strong outliers. Algorithm must be robustified to prevent such distortions.

41
Q

Why is SVDD not naturally robust (in formula)?

A

Ei = (||xi - c||^2) - S, where Ei = slack

Penalty (slack) grows quadratically with the distance of the point x from the center c. Therefore, the further the outlier from the center, the stronger the center is being pulled towards that outlier.

42
Q

Formula to Robustify SVDD

A

Min (R + ( C * ∑max( 0, ||xi - c|| - R ) ) )

Now the penalty (Ei = ||xi - c|| - R) grows linearly (rather than quadratically) with the distance of the data point from the center.

43
Q

Cluster Dispersion

A

Data points are grouped closely together in certain areas, with gaps or low-density regions in between.

Pattern often results from factors that lead to groupings or clusters (social behaviors, resource availability, habitat suitability, etc)

44
Q

The Centroid Model (K-Means Clustering)

A

Each cluster is represented by a centroid µ that lies in the same d-dimensional space as the data. Members of the cluster (data points in the cluster) are typically close.

K-Means is the most well known centroid-based clustering technique

45
Q

How the K-Means algorithm works

A

Partition dataset into K clusters by minimizing the variance (dispersion) within each cluster in following steps:

  1. Random initialization (choose K initial cluster centroids)
  2. Assign each datapoint to the nearest centroid’s cluster
  3. Update the centroid of each cluster to be the mean of all points currently assigned to cluster K
  4. Repeat steps 2 and 3 until convergence occurs (cluster centroids no longer move and clusters no longer change)
46
Q

Goal of K-Means Algorithm

A

Minimize the objective function (or cost function) which is the sum of squared distances between each data point and the centroid of its assigned center (in each cluster).

47
Q

Calinski-Harabasz Index (K-Means)

A

(Between-Cluster Dispersion / Within-Cluster Dispersion) * ( (N-K) / (K-1) , where n is number of datapoints and k clusters

The index can be used to evaluate the quality of the clustering colution. The higher the index value, the better quality the clustering.

48
Q

Alternative approach to choose the number of clusters (K) in K-Means besides Calinski Harabasz Index:

A

Explained variance measures how much of the data’s total variance is captured by the clustering solution. (Elbow method).

This involves looking for an inflection point in the explained variance curve (called elbow method).

49
Q

Limits of K-Means

A
  • Initialization point matters for the end result (advisable to run k-means algorithms several times with different starting points)
  • Focuses on minimizing variance within clusters but isn’t good for maximizing variance between clusters
50
Q

Spectral Clustering

A
  1. Graph of points ->
  2. Laplacian Matrix (describes graph) constructed using Adjacency Matrix and Degree Matrix ->
  3. Extract eigenvalues and eigenvectors matrix ->
  4. Choose smallest eigenvalues and their eigenvectors to extract most important info about graph structure ->
  5. Eigenvectors selected are used as features for datapoints (# of eigenvalues selected is number of clusters) ->
  6. Run clustering (ie K-means) algorithm on new features to group datapoints into clusters
51
Q

When is it not true that, for spectral clustering, the number of eigenvalues = 0 are the number of clusters

A

When there is noise in the data. The non-noise scenario is simply the formal, theoretical scenario.

52
Q

Cheegers’ Inequality

A

Conductance of the dataset is related to how separate or different clusters in a graph are.

If second smallest eigenvalue of a graph is low, then conductance is also low, which means the graph is easy to break into clusters.

High conductance means it is hard to break into clusters.

53
Q

Spectral Clustering Strength

A

Better focus on the border of the cluster than a direct application of K-means (can be robust to noise)

54
Q

Spectral Clustering Weaknesses

A

Several parameters to tune (adjacency matrix, laplacian matrix, number of eigenvectors)

Clusters do not come with a prototype (centroid or mean) which makes it difficult to interpret each cluster