3 ML Models of Dispersion (PCA, Clustering, Anomalies) Flashcards

Question 1

Q

Do visualizations give quantifiable insights?

Question 2

Q

What is dispersion?

Answer

A

Dispersion indicates how much variation there is in some dataset

Question 3

Q

Number of Distinct Data Points
Radius of Minimum Enclosing Sphere
Average Square Euclidean Distance from the Dataset Mean m:

Answer

A

Various possible measures of dispersion

Question 4

Q

Principal Component Analysis

Answer

A

Principal Component Analysis (PCA) is a statistical technique that transforms data to identify the directions (principal components) that maximize variance while minimizing reconstruction error, thereby reducing dimensionality while retaining essential information.

Question 5

Q

Why is it important to maximize dispersion in PCA

Answer

A

Dispersion = Variance = Information, we want to retain as much information as possible

Question 6

Q

The Lagrange Multiplier Method

Answer

A

The Lagrange multiplier method is used in PCA to maximize the variance of data projections onto a direction
u, but to the constraint that u is a unit vector, by incorporating the constraint into the optimization of the variance function through the Lagrange function.

In PCA, the constraint is that the direction vector u must be a unit vector, which mathematically can be expressed as ∥u∥^2=1 (or equivalently, |u⊤u|=1). This ensures that the projections of the data maintain their relative scales and only capture the directionality of the data’s variance.

Question 7

Q

How do I rank principal components?

Answer

A

Based on lambda (the eigenvalue) from the Langrangian formula

Question 8

Q

Is PCA able to work with outliers?

Answer

A

No. Must clean outliers before doing PCA.

Question 9

Q

Data must be centered to capture true PCA direction, how is this done?

Answer

A

Set mean to zero by subtracting the mean of each feature from the data points so that each feature has a mean of 0. (Different from standardization because the standard deviation after centering isn’t necessarily 1).

Question 10

Q

Does PCA work for strongly non-Gaussian data?

Answer

A

No. Needs data centering to work in that data won’t vary locally in different directions.

Question 11

Q

PCA process for multiple components

Answer

A

You take a vector x and identify a direction in which the data varies most (the principal component). This direction can also be represented as a vector and can be broken down into:

What PCA can capture
Residual (what PCA can’t capture)

From the residue, a secondary principal component can be found within. Each additional component captures more information from the original data that the previous components missed.

Question 12

Q

Code for directly computing all eigenvectors/eigenvalues and sorting them by decreasing eigenvalues for PCA in Python (the way with Covariance Matrix (classical way))

Answer

A

numpy.linalg.eigh

Question 13

Q

PCA Biplot

Answer

A

Visual way to look at data using PCA results.

Uses 2 principal components (directions) that explain the most variation in the data. The two directions are used for x and y axis. Each data point on graph represents an instance from original data.

For each feature (column) in original data, an arrow is drawn in the direction that the feature most influences. This is called “loading vector”. One feature can be given color instead of loading vector.

Question 14

Q

Generalized Variance in PCA

Answer

A

The total dispersion of the data. It can also be called Trace.

It is the sum of all diagonal elements (eigenvalues) in the covariance matrix (which captures covariances between features)

Question 15

Q

? = Sum of all Eigenvalues in Covariance Matrix

Answer

A

Trace (total variance captured in dataset)

Question 16

Q

How would a biplot with two PC’s with eigenvalues that equate to 80% and 15% of the total dispersion respectively look?

Answer

A

It’d be a nearly linear line.

This is because one direction responsible for most of the information (variance) so the other one is pretty much dependent on that direction.

Question 17

Q

How would a biplot with two PC’s with eigenvalues that equate to 45% and 45% of the total dispersion respectively look?

Answer

A

It’d be a scattered ball like thing with no correlation.

This is because no direction responsible for most of the information (variance) so they are both pretty independent.

Question 18

Q

PCA Scree Plot

Answer

A

Bar plot with each bar being a principal component. Y axis is explained variance (eigenvalue) and X is component name. (But can also be cumulative plot sometimes)

Question 19

Q

What is artifact removal?

Answer

A

Sometimes the Principal Component giving off the most information isn’t relevant to the problem: ie eyes blinking is EEG recordings trying to extract neuronal activity.

In this case we ignore that principal component. Artifact indicates it had a high eigenvalue in comparison to the total dispersion.

Question 20

Q

What is denoising?

Answer

A

Ignoring principal components with small eigenvalues to focus on the main components involved in describing the data variation. (Remove noise)

Question 21

Q

Problem with Covariance Matrix

Answer

A

Covariance matrix requires dimensions dxd (where d is number of features (columns) in data set). For lots of datatypes (images or biological samples), the number of features can exceed 10,000. This is super costly to store in memory.

SVD is efficient alternative to avoid making covariance matrix, when rows is less than columns.

Question 22

Q

Code for directly computing all eigenvectors/eigenvalues and sorting them by decreasing eigenvalues for PCA in Python (the way with Singular Value Decomposition (efficient way))

Answer

A

numpy.linalg.svd(M, full_matrices=False)

‘full_matrices = False’ reduces number of eigenvectors (components) to just those that don’t have an eigenvalue of 0.

Question 23

Q

When to use SVD instead of Covariance Matrix

Answer

A

When there are more columns (features) than rows (observations) in the large dataset.

Question 24

Q

Power Iteration Technique

Answer

A

POWIT is an iterative technique used to find the dominant eigenvector of a matrix.

Question 25

Q

Anomalies

Answer

A

Points that escape a model’s prediction capability

Question 26

Q

Classical approach for anomaly detection

Answer

A

Z-score = (x - mean) / Standard Deviation, where x is data point

Z between -1 and 1 is not unusual. Between —3,-2 and 2,3 is moderately unusual. Beyond abs(3) is an outlier.

Question 27

Q

Mahalanobis Distance

Answer

A

Measures distance between a data point and the center of a distribution when there are multiple correlated variables. (Ex: when measuring weight and height as they can have correlations)

Z = √(((x-mean)^T)(inverse covariance matrix)(x-mean))

Question 28

Q

Limits of Mahalanobis Distance

Answer

A

In high dimensions (more features than datapoints), sample correlations that don’t reflect the true population correlation arise in covariance matrix. The Mahalanobis Distance will have inflated anomaly scores. (Can be fixed by adding small diagonal term to covariance matrix to make anomaly score more robust)
When distribution is non-Gaussian, Mahalanobis doesn’t describe data well

Question 29

Q

Boundary Based Anomaly Detection

Answer

A

Focused on finding a boundary around most normal data points, such that any points outside the boundary are considered anomalies.

Question 30

Q

Objective function formula for finding a minimum enclosing object (boundary based anomaly detection), in other words, encloses the majority of the data points.

Answer

A

S = size of shape (radius if shape is sphere)
c = center of the boundary
Ei = slack variables that allow some points to fall outside the boundary (non-negative)
C = parameter that controls the trade-off between a tight boundary and some points to fall outside

We want to minimize S + C(sum Ei)

Question 31

Q

What are Karush-Kuhn-Tucker (KKT) conditions in boundary based anomaly detection?

Answer

A

Necessary conditions for a solution to be optimal in a constrained optimization problem.

Stationarity (solution minimizes function without any possible further reduction)
Primal Feasibility (solution meets all the problem’s original constraints, ex: must stay within limits of weight and volume set in problem, must be non-negative)
Dual Feasibility (penalties apply if constraints are tight, ex: penalty (weight in Lagrange function) applied if value is close to boundary)
Complementary Slackness (ensures only the active constraints influence the solution)

Question 32

Q

Constrained optimization problem with boundary based anomaly detection (where KKT framework conditions apply):

Answer

A

L(ø,lamda) =f(ø) + ∑(lambda*gi(ø)),

f(ø) is objective function we want to minimize (S + C(sum Ei)) and gi(ø) is for each variable in which a constraint applies (ie weight, height).

gi(ø) = h(xi,c) - S - Ei
- H = distance between point and center (it must be less than the sum of size of the shape and total slack)

Lambda is Lagrange multiplier (weight applied to balance objective function and constraints).

Question 33

Q

Lagrange Dual approach to performing dual optimization of L(ø,lamda) in boundary based anomaly detection:

Answer

A

First, Minimize L with respect to ø (minimum shape size with max information)

Then, maximize with respect to lambda (Lagrange multipliers) to ensure all constraints are fulfilled

Question 34

Q

Lagrange Primal Approach to performing optimization of boundary based anomaly detection:

Answer

A

Solve for ø directly, aiming to minimize f(ø) while keeping constraints (KTT) satisfied as I go along.

Question 35

Q

SVDD: Support Vector Data Description

Answer

A

A form of boundary based anomaly detection that uses a spherical boundary to encapsulate normal data.

Question 36

Q

Why solve Dual Optimization Problem instead of Primal for SVDD

Answer

A

Primal optimization problem has 1+N+d parameters, meanwhile dual only has alpha (Lagrange multiplier less than or equal to N)

N = number of data points
d = number of dimensions (features)
a = non-zero lambda (Lagrange multiplier, weights) that represent influence of each data point on boundary (a is only non zero when close to the boundary)

More parameters = more computationally expensive

Question 37

Q

What parameters can be recovered from the Dual Optimization problem for SVDD?

Answer

A

Center of the hypersphere can be recovered by the KKT stationary condition dL/dc = 0. This implies c = ∑(ai*xi), which is weighted average of points in the dataset

Radius of the hypersphere (R, where R^2=S) can be recovered from the KKT complementary slackness equation R = ||c-xi||

Question 38

Q

What is better (more robust) for skewed data: Mahalanobis (mean vector) or SVDD (center of hypersphere)?

Answer

A

Support vector data description

This is due to the ability of SVDD to focus mainly on the border of the distribution and not the internal distribution within the borders.

Question 39

Q

On a high dimensional image, which would be more blurry?: Mahalanobis (mean vector) or SVDD (center of hypersphere)?

Answer

A

SVDD is more blurry.

This is due to focusing on border of distribution, which creates a softer boundary around data points compared to the tighter boundary that Mahalanobis distance based methods establish.

Question 40

Q

Weakness of SVDD besides its slight blurriness (in images)

Answer

A

Boundary of SVDD is greatly distorted by strong outliers. Algorithm must be robustified to prevent such distortions.

Question 41

Q

Why is SVDD not naturally robust (in formula)?

Answer

A

Ei = (||xi - c||^2) - S, where Ei = slack

Penalty (slack) grows quadratically with the distance of the point x from the center c. Therefore, the further the outlier from the center, the stronger the center is being pulled towards that outlier.

Question 42

Q

Formula to Robustify SVDD

Answer

A

Min (R + ( C * ∑max( 0, ||xi - c|| - R ) ) )

Now the penalty (Ei = ||xi - c|| - R) grows linearly (rather than quadratically) with the distance of the data point from the center.

Question 43

Q

Cluster Dispersion

Answer

A

Data points are grouped closely together in certain areas, with gaps or low-density regions in between.

Pattern often results from factors that lead to groupings or clusters (social behaviors, resource availability, habitat suitability, etc)

Question 44

Q

The Centroid Model (K-Means Clustering)

Answer

A

Each cluster is represented by a centroid µ that lies in the same d-dimensional space as the data. Members of the cluster (data points in the cluster) are typically close.

K-Means is the most well known centroid-based clustering technique

Question 45

Q

How the K-Means algorithm works

Answer

A

Partition dataset into K clusters by minimizing the variance (dispersion) within each cluster in following steps:

Random initialization (choose K initial cluster centroids)
Assign each datapoint to the nearest centroid’s cluster
Update the centroid of each cluster to be the mean of all points currently assigned to cluster K
Repeat steps 2 and 3 until convergence occurs (cluster centroids no longer move and clusters no longer change)

Question 46

Q

Goal of K-Means Algorithm

Answer

A

Minimize the objective function (or cost function) which is the sum of squared distances between each data point and the centroid of its assigned center (in each cluster).

Question 47

Q

Calinski-Harabasz Index (K-Means)

Answer

A

(Between-Cluster Dispersion / Within-Cluster Dispersion) * ( (N-K) / (K-1) , where n is number of datapoints and k clusters

The index can be used to evaluate the quality of the clustering colution. The higher the index value, the better quality the clustering.

Question 48

Q

Alternative approach to choose the number of clusters (K) in K-Means besides Calinski Harabasz Index:

Answer

A

Explained variance measures how much of the data’s total variance is captured by the clustering solution. (Elbow method).

This involves looking for an inflection point in the explained variance curve (called elbow method).

Question 49

Q

Limits of K-Means

Answer

A

Initialization point matters for the end result (advisable to run k-means algorithms several times with different starting points)
Focuses on minimizing variance within clusters but isn’t good for maximizing variance between clusters

Question 50

Q

Spectral Clustering

Answer

A

Graph of points ->
Laplacian Matrix (describes graph) constructed using Adjacency Matrix and Degree Matrix ->
Extract eigenvalues and eigenvectors matrix ->
Choose smallest eigenvalues and their eigenvectors to extract most important info about graph structure ->
Eigenvectors selected are used as features for datapoints (# of eigenvalues selected is number of clusters) ->
Run clustering (ie K-means) algorithm on new features to group datapoints into clusters

Question 51

Q

When is it not true that, for spectral clustering, the number of eigenvalues = 0 are the number of clusters

Answer

A

When there is noise in the data. The non-noise scenario is simply the formal, theoretical scenario.

Question 52

Q

Cheegers’ Inequality

Answer

A

Conductance of the dataset is related to how separate or different clusters in a graph are.

If second smallest eigenvalue of a graph is low, then conductance is also low, which means the graph is easy to break into clusters.

High conductance means it is hard to break into clusters.

Question 53

Q

Spectral Clustering Strength

Answer

A

Better focus on the border of the cluster than a direct application of K-means (can be robust to noise)

Question 54

Q

Spectral Clustering Weaknesses

Answer

A

Several parameters to tune (adjacency matrix, laplacian matrix, number of eigenvectors)

Clusters do not come with a prototype (centroid or mean) which makes it difficult to interpret each cluster

Brainscape's Knowledge GenomeTM

3 ML Models of Dispersion (PCA, Clustering, Anomalies) Flashcards

Brainscape's Knowledge Genome^TM