4 Models of Correlations (CCA, Regression, Fisher) Flashcards

1
Q

Pearson’s Correlation

A

P = (covariance xy) / (std x * std y), P is always between -1 and 1.

1 means perfectly positively correlated, 0 means no correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Limit of Pearson’s Correlation

A

Can only detect certain types of correlations (linear correlations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Mutual Probability

A

The probability of two or more events happening at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is there any relation between Pearson’s correlation and mutual probability?

A

Yes there is a direct relation.

Min E[(y - ßx)^2] = Var(y) - (1-p^2), where p is Pearson Correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Residual Prediction Error Min E[(y - ßx)^2] meaning

A

Try to minimize Residual Prediction Error. This error is the mismatch between y and ßx, where we try to predict y using x with ß.

X is independent variable and y is dependent. ß is parameter (or coefficient) that defines relationship between x and y. Minimal RPE means ß is very good.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Canonical Correlation Analysis (CCA)

A

CCA is a statistical method used to understand the relationship between two sets of variables. It measures and identifies associations between two multivariate datasets.

(Ex: correlation between text and images. Both are multivariate: composed of multiple visual features or words)

The eigenvectors wx and wy are the canonical weights for the two sets of variables

The first eigenvalue (largest) represents the correlation denoted as p between the two sets in their new projected space (space defined by the canonical variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Solution of CCA

A

Eigenvalue * [Cxx 0 0 Cyy] * [wx wy]

The eigenvectors wx and wy are the canonical weights for the two sets of variables

The first (largest) eigenvalue represents the correlation denoted as p between the two sets in their new projected space (space defined by the canonical variables)

Cxx is covariance matrix x, for variables within set x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What to use to find Nonlinear Correlations

A

CCA can be used not only for linear multivariate (two sets) correlations but also for non linear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lagrangian Multipliers

A

Method to find the maximum or minimum (optimum) of a function while satisfying a constraint, by ensuring the gradients of the function and the constraint are proportional

(Ex: finding max profit but with a budget constraint)

L(ø, lambda) = f(ø) + (lambda * g(ø)), this is ‘the Lagrangian’

To find the optimum (min, max): L(ø, lambda) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In CCA which eigenvalue is chosen and how?

A

Eigenvalue with the largest value (max)

Eigenvalue = Pearson correlation = Max Corr(W†x, W†y) =
max ( (W†xCxyWy) / ( (√(W†xCxxWx) * (√(W†yCyyWy) )

W†xCxxWx = 1, same for y

Therefore, max (W†xCxyWy) = eigenvalue = p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regularization

A

Fix instability that arises if Eigenvalues in B (self correlation matrix for each dataset) are near zero by adding noise through diagonal term (ϵI). Instability means we can’t trust the directions [wx, wy].

B <- B + ϵI

(B^-1)Aw = (lambda)(w), here we’d replace B with B+ diagonal term to improve stability.

B: self correlation matrix (within each dataset)
A: cross-correlation matrix (between datasets)
w = [wx wy] : directions we want to learn for projecting datasets (these are eigenvectors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to deal with CCA in high dimensions size where d > N ( features > instances).

A

Instead of solving it as dxd (features x features), we work in NxN (instances x instances) space

Generalized eigenvalue problem solved with weighted combination of the data

Wx = ∑(x - µ)a = Xa
Wy = ∑(y - µ)a = Ya

Optimal (Xa, Ya) = Optimal (Wx, Wy) : (optimal directions to model relationship between datasets is the same)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Temporal CCA

A

Temporal CCA (Canonical Correlation Analysis) is a method that extends standard CCA by incorporating time-lagged versions of the datasets to maximize correlations not only between variables in the same time frame but also across different time steps, capturing temporal dependencies and dynamics in the relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Least Squares Regression

A

Least Squares Regression is a method to find the best-fit line for predicting a target y from input x, by minimizing the average squared difference between the actual and predicted values. (Minimized square difference between f(x) and y)

f(x) = w†x + b (w is weight, x is input, b is bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is bias b in a linear model?

A

Bias b in a linear model is a constant term that allows the model’s predictions to shift (up or down) independent of input x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is best bias for centered data in linear model?

A

b = 0.

Centered means E[x] = 0 and E[y] = 0.

So f(x) = w†x

17
Q

Mean Squared Error Formula (non optimal)

A

E[(w†x - y)^2] = MSE

18
Q

Optimal weight vector that minimizes the mean squared error (f(x) - y in linear model)

A

W = (Cxx^-1)*(Cxy)

19
Q

Optimal MSE formula (MSE at optimum)

A

Optimum MSE = Cyy - ( (Cyx) * (Cxx^-1) * (Cxy) ), found by using optimal weight (Cxx^-1*Cxy)

20
Q

Relationship between CCA and Linear Regression in terms of Variance of Target Variable y

A

Cyy = Cyy(P^2) + E

Cyy(P^2): variance explained by the model (maximized in CCA)

E: Mean squared error (Minimized by linear regression)

Cyy: Variance of Target Variable y (covariance matrix because y is multidimensional)

21
Q

How to deal with non invertible Covariance Matrix for X when trying to find optimum MSE in Linear Regression Model: Optimum MSE = Cyy - ( (Cyx) * (Cxx^-1) * (Cxy) )

A

Non-invertible Cxx happens when features are highly correlated or there are fewer samples than features.

Solution is to add a small amount of uncorrelated noise n to the data. This ensures Cxx becomes invertible.

So Cxx becomes C(x+n,x+n)

Use the minimum noise possible (enough to invert Cxx) because more noise = larger error

22
Q

What to do for calculation of least squared error for high dimensional data

A

For high-dimensional data, represent w as w=Xα, where X is the centered data matrix, and reformulate the objective using N×N matrices (Q^2x and Qx ) to avoid computing or inverting the large d×d covariance matrix Cxx . This reduces computational complexity when N≪d

23
Q

How to reduce linear regression model sensitivity to outliers

A

Reduced sensitivity to outliers can be achieved by considering absolute deviations instead of square errors, and invariance to small noise can be maintained by introducing a small slack ϵ.

Instead of min E[ (w†x - y)^2], we use min E [ max(0, |w†x + b - y| - ϵ)]. We must reintroduce bias because we are no longer guaranteed that solution w/o bias is optimal.

If there are potentially multiple solutions that exactly solve the problem (especially when d > N), we introduce small penalty term lambda||w||^2 that favors flattest solution

min E [ max(0, |w†x + b - y| - ϵ)] + lambda*(||w||^2)

24
Q

Support Vector Regression (SVR)

A

Support Vector Regression (SVR) aims to find a function f(x)=w⊤x+b that predicts y with minimal error while being robust to noise. The objective is to keep prediction errors within a margin ϵ, allowing some slack for points outside the margin.

Primal solves directly for weights w and bias b (low dimensional). Dual solves for Lagrange multipliers a, a* (high dimensional).

25
Q

Discriminant Learning (Difference of Means)

A

Find a unit-norm vector w that maximizes the difference between two groups of data points (G1, G2) when the data is projected onto w (on average).

max w†(µ1 - µ2)

Optimal w = (µ1 - µ2) / √ ( ∑ ( (µ1-µ2)^2 ) )

26
Q

Fisher Discriminant

A

The Fisher Discriminant is a method used to find a line (or a direction in higher dimensions) that best separates two groups of data by maximizing the distance between the groups while minimizing the spread (variance) within each group

Fisher maximizes ratio: J(w) = Between class separation / Within class spread = ((w†(µ1-µ2))^2) / (w†Sw)

Optimal w = (S^-1)(µ1 - µ2)

S = S1 + S2 = Variance for G1 + Variance for G2

Very sensitive to outliers (skew means and increase scatter within class)

27
Q

Connection to CCA from Fisher Discriminant

A

The vectors (CCA’s wx and Fisher’s w) produced by Fisher Discriminant and CCA are the same in binary classification problems because both methods ultimately rely on the difference between group means and within-group variance to find the optimal direction for separating the two groups.

Binary means 2 possible classifications (ie 1, 0 & positive, negative)

28
Q

Support Vector Machine

A

A Support Vector Machine (SVM) is a machine learning algorithm used for classification (and sometimes regression) that works by finding the best boundary (hyperplane) that separates data into different classes.

SVM finds a hyperplane that separates the two groups, but it does so by maximizing the margin between them.

It allows for some misclassification (outliers) through slack variables ξi, and it penalizes absolute deviations rather than square errors.

The optimization problem aims to find the hyperplane with the largest margin while allowing some misclassification based on the regularization parameter C.

29
Q

SVM vs Fisher (when to use)

A

SVM deals with outliers better. However, Fisher is better if we care about within group homogeneity. Fisher is also connected to CCA which can be interpreted as maximizing correlation (SVM isn’t).

Within-group homogeneity refers to the degree of similarity or consistency within each group in terms of the data points. In other words, it measures how tightly the data points in each group cluster around the group’s mean.