4 Models of Correlations (CCA, Regression, Fisher) Flashcards

Question 1

Q

Pearson’s Correlation

Answer

A

P = (covariance xy) / (std x * std y), P is always between -1 and 1.

1 means perfectly positively correlated, 0 means no correlation

Question 2

Q

Limit of Pearson’s Correlation

Answer

A

Can only detect certain types of correlations (linear correlations)

Question 3

Q

Mutual Probability

Answer

A

The probability of two or more events happening at the same time

Question 4

Q

Is there any relation between Pearson’s correlation and mutual probability?

Answer

A

Yes there is a direct relation.

Min E[(y - ßx)^2] = Var(y) - (1-p^2), where p is Pearson Correlation

Question 5

Q

Residual Prediction Error Min E[(y - ßx)^2] meaning

Answer

A

Try to minimize Residual Prediction Error. This error is the mismatch between y and ßx, where we try to predict y using x with ß.

X is independent variable and y is dependent. ß is parameter (or coefficient) that defines relationship between x and y. Minimal RPE means ß is very good.

Question 6

Q

Canonical Correlation Analysis (CCA)

Answer

A

CCA is a statistical method used to understand the relationship between two sets of variables. It measures and identifies associations between two multivariate datasets.

(Ex: correlation between text and images. Both are multivariate: composed of multiple visual features or words)

The eigenvectors wx and wy are the canonical weights for the two sets of variables

The first eigenvalue (largest) represents the correlation denoted as p between the two sets in their new projected space (space defined by the canonical variables)

Question 7

Q

Solution of CCA

Answer

A

Eigenvalue * [Cxx 0 0 Cyy] * [wx wy]

The eigenvectors wx and wy are the canonical weights for the two sets of variables

The first (largest) eigenvalue represents the correlation denoted as p between the two sets in their new projected space (space defined by the canonical variables)

Cxx is covariance matrix x, for variables within set x.

Question 8

Q

What to use to find Nonlinear Correlations

Answer

A

CCA can be used not only for linear multivariate (two sets) correlations but also for non linear.

Question 9

Q

Lagrangian Multipliers

Answer

A

Method to find the maximum or minimum (optimum) of a function while satisfying a constraint, by ensuring the gradients of the function and the constraint are proportional

(Ex: finding max profit but with a budget constraint)

L(ø, lambda) = f(ø) + (lambda * g(ø)), this is ‘the Lagrangian’

To find the optimum (min, max): L(ø, lambda) = 0

Question 10

Q

In CCA which eigenvalue is chosen and how?

Answer

A

Eigenvalue with the largest value (max)

Eigenvalue = Pearson correlation = Max Corr(W†x, W†y) =
max ( (W†xCxyWy) / ( (√(W†xCxxWx) * (√(W†yCyyWy) )

W†xCxxWx = 1, same for y

Therefore, max (W†xCxyWy) = eigenvalue = p

Question 11

Q

Regularization

Answer

A

Fix instability that arises if Eigenvalues in B (self correlation matrix for each dataset) are near zero by adding noise through diagonal term (ϵI). Instability means we can’t trust the directions [wx, wy].

B <- B + ϵI

(B^-1)Aw = (lambda)(w), here we’d replace B with B+ diagonal term to improve stability.

B: self correlation matrix (within each dataset)
A: cross-correlation matrix (between datasets)
w = [wx wy] : directions we want to learn for projecting datasets (these are eigenvectors)

Question 12

Q

How to deal with CCA in high dimensions size where d > N ( features > instances).

Answer

A

Instead of solving it as dxd (features x features), we work in NxN (instances x instances) space

Generalized eigenvalue problem solved with weighted combination of the data

Wx = ∑(x - µ)a = Xa
Wy = ∑(y - µ)a = Ya

Optimal (Xa, Ya) = Optimal (Wx, Wy) : (optimal directions to model relationship between datasets is the same)

Question 13

Q

Temporal CCA

Answer

A

Temporal CCA (Canonical Correlation Analysis) is a method that extends standard CCA by incorporating time-lagged versions of the datasets to maximize correlations not only between variables in the same time frame but also across different time steps, capturing temporal dependencies and dynamics in the relationships.

Question 14

Q

Least Squares Regression

Answer

A

Least Squares Regression is a method to find the best-fit line for predicting a target y from input x, by minimizing the average squared difference between the actual and predicted values. (Minimized square difference between f(x) and y)

f(x) = w†x + b (w is weight, x is input, b is bias)

Question 15

Q

What is bias b in a linear model?

Answer

A

Bias b in a linear model is a constant term that allows the model’s predictions to shift (up or down) independent of input x

Question 16

Q

What is best bias for centered data in linear model?

Answer

A

b = 0.

Centered means E[x] = 0 and E[y] = 0.

So f(x) = w†x

Question 17

Q

Mean Squared Error Formula (non optimal)

Answer

A

E[(w†x - y)^2] = MSE

Question 18

Q

Optimal weight vector that minimizes the mean squared error (f(x) - y in linear model)

Answer

A

W = (Cxx^-1)*(Cxy)

Question 19

Q

Optimal MSE formula (MSE at optimum)

Answer

A

Optimum MSE = Cyy - ( (Cyx) * (Cxx^-1) * (Cxy) ), found by using optimal weight (Cxx^-1*Cxy)

Question 20

Q

Relationship between CCA and Linear Regression in terms of Variance of Target Variable y

Answer

A

Cyy = Cyy(P^2) + E

Cyy(P^2): variance explained by the model (maximized in CCA)

E: Mean squared error (Minimized by linear regression)

Cyy: Variance of Target Variable y (covariance matrix because y is multidimensional)

Question 21

Q

How to deal with non invertible Covariance Matrix for X when trying to find optimum MSE in Linear Regression Model: Optimum MSE = Cyy - ( (Cyx) * (Cxx^-1) * (Cxy) )

Answer

A

Non-invertible Cxx happens when features are highly correlated or there are fewer samples than features.

Solution is to add a small amount of uncorrelated noise n to the data. This ensures Cxx becomes invertible.

So Cxx becomes C(x+n,x+n)

Use the minimum noise possible (enough to invert Cxx) because more noise = larger error

Question 22

Q

What to do for calculation of least squared error for high dimensional data

Answer

A

For high-dimensional data, represent w as w=Xα, where X is the centered data matrix, and reformulate the objective using N×N matrices (Q^2x and Qx ) to avoid computing or inverting the large d×d covariance matrix Cxx . This reduces computational complexity when N≪d

Question 23

Q

How to reduce linear regression model sensitivity to outliers

Answer

A

Reduced sensitivity to outliers can be achieved by considering absolute deviations instead of square errors, and invariance to small noise can be maintained by introducing a small slack ϵ.

Instead of min E[ (w†x - y)^2], we use min E [ max(0, |w†x + b - y| - ϵ)]. We must reintroduce bias because we are no longer guaranteed that solution w/o bias is optimal.

If there are potentially multiple solutions that exactly solve the problem (especially when d > N), we introduce small penalty term lambda||w||^2 that favors flattest solution

min E [ max(0, |w†x + b - y| - ϵ)] + lambda*(||w||^2)

Question 24

Q

Support Vector Regression (SVR)

Answer

A

Support Vector Regression (SVR) aims to find a function f(x)=w⊤x+b that predicts y with minimal error while being robust to noise. The objective is to keep prediction errors within a margin ϵ, allowing some slack for points outside the margin.

Primal solves directly for weights w and bias b (low dimensional). Dual solves for Lagrange multipliers a, a* (high dimensional).

Question 25

Q

Discriminant Learning (Difference of Means)

Answer

A

Find a unit-norm vector w that maximizes the difference between two groups of data points (G1, G2) when the data is projected onto w (on average).

max w†(µ1 - µ2)

Optimal w = (µ1 - µ2) / √ ( ∑ ( (µ1-µ2)^2 ) )

Question 26

Q

Fisher Discriminant

Answer

A

The Fisher Discriminant is a method used to find a line (or a direction in higher dimensions) that best separates two groups of data by maximizing the distance between the groups while minimizing the spread (variance) within each group

Fisher maximizes ratio: J(w) = Between class separation / Within class spread = ((w†(µ1-µ2))^2) / (w†Sw)

Optimal w = (S^-1)(µ1 - µ2)

S = S1 + S2 = Variance for G1 + Variance for G2

Very sensitive to outliers (skew means and increase scatter within class)

Question 27

Q

Connection to CCA from Fisher Discriminant

Answer

A

The vectors (CCA’s wx and Fisher’s w) produced by Fisher Discriminant and CCA are the same in binary classification problems because both methods ultimately rely on the difference between group means and within-group variance to find the optimal direction for separating the two groups.

Binary means 2 possible classifications (ie 1, 0 & positive, negative)

Question 28

Q

Support Vector Machine

Answer

A

A Support Vector Machine (SVM) is a machine learning algorithm used for classification (and sometimes regression) that works by finding the best boundary (hyperplane) that separates data into different classes.

SVM finds a hyperplane that separates the two groups, but it does so by maximizing the margin between them.

It allows for some misclassification (outliers) through slack variables ξi, and it penalizes absolute deviations rather than square errors.

The optimization problem aims to find the hyperplane with the largest margin while allowing some misclassification based on the regularization parameter C.

Question 29

Q

SVM vs Fisher (when to use)

Answer

A

SVM deals with outliers better. However, Fisher is better if we care about within group homogeneity. Fisher is also connected to CCA which can be interpreted as maximizing correlation (SVM isn’t).

Within-group homogeneity refers to the degree of similarity or consistency within each group in terms of the data points. In other words, it measures how tightly the data points in each group cluster around the group’s mean.

Brainscape's Knowledge GenomeTM

4 Models of Correlations (CCA, Regression, Fisher) Flashcards

Brainscape's Knowledge Genome^TM