11 | DW-1 | PCA Flashcards

1
Q

(QUIZ 6)
PCA was invented by ______ in ______.

A

Karl Pearson, 1901

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(QUIZ 4)
The goal of PCA is to replace a ______ number of ______variables with a ______ number of ______ variables while capturing as much information in the ______ variables as possible. Principal components are ______ combinations of the ______ variables. PC1 is the ______ combination of the k observed variables that accounts for most of the variance in the original set of variables. PC2 is ______to PC1.

A

The goal of PCA is to replace a large number of correlated variables with a small number of ______ variables while capturing as much information in the original variables as possible. Principal components are linear combinations of the observed variables. PC1 is the weighted combination of the k observed variables that accounts for most of the variance in the original set of variables. PC2 is orthogonal to PC1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(QUIZ 4)
The amount of variance kept in the PC’s can be visualized in a ______. To decide how many components to investigate we usually look at the ______variance. One threshold is often the 90% limit. We use as many components until we reach this limit. The positions of the samples in this new coordinates system are visualized in a so called ______. If we would like to show both the positions of the samples and the ______ of the ______variables we can use a ______.

A

The amount of variance kept in the PC’s can be visualized in a screeplot. To decide how many components to investigate we usually look at the cumulative variance. One threshold is often the 90% limit. We use as many components until we reach this limit. The positions of the samples in this new coordinates system are visualized in a so called scoreplot. If we would like to show both the positions of the samples and the correlations of the original variables we can use a biplot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(QUIZ 4)
~~~
Please interpret the results:
> data(iris)
> pca=prcomp(iris[,1:4])
> summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
> pca$rotation
PC1 PC2 PC3 PC4
Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
~~~
The major variance is in the first ______component(s). The variance contributing most to the first component is ______, the second component is the ______component.
The third component is ______ as it contributes ______to the total variance.

A

The major variance is in the first one component(s). The variance contributing most to the first component is Petal.Length, the second component is the Sepal component.
The third component is not important as it contributes less (Detlef said more) than 5% to the total variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(QUIZ 4)
PCA is a ______ projection method. PCA will fail if ______ data are to be processed. In that case, ______ may be the better choice. To determine deviation from Gaussianity, ______can be applied. An advantageous property is its application of a ______distance metric.

A

PCA is a linear projection method. PCA will fail if non Gaussian data are to be processed. In that case, ICA may be the better choice. To determine deviation from Gaussianity, Kurtosis can be applied. An advantageous property is its application of a density aware distance metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Motivation for PCA / dimensionality reduction

A
  • How do the different samples group together?
  • Which molecules (genes, metabolites,….variables) are important with regard to sample separation, whichones are noise only?
  • Which molecules show a correlated behavior and can thus be treated as one?
  • Is there a way to „view“ the data in a meaningful way?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

PCA invented by/when?
* Karl Pearson in 1901

A

What assumption for finding the principal component?
* Direction of greatest variance (σ 2) captures most relevant info about system
* It is said to be the First Principal Component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Signal to Noise Ratio (SNR)

A
  • SNR= s 2signal / s 2noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PCA Qualitatively – first step?

A
  • Find the centroid (mean along all coordinates)= origin of the new basis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PCA Qualitatively – second step?

A
    1. Find direction d along which variance is maximal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PCA Qualitatively – third step?

A
    1. Find direction of greatest variance in plane that is perpendicular to d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

PCA Qualitatively – fourth step?

A
    1. Repeat n-times, where n is the number of original dimensions (here 3). (Last vector is determined by orthogonality criterion).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

PCA Qualitatively - How to express coordinate of every point using new basis

A
  • New coordinates correspond to projections of the old coordinates onto the PCs, which is equivalent to a rotation of the old coordinate set around the centroid such that the directions of the old and new base vectors line up.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a score plot?

A
  • Score Plot = plot of data points in new coordinate system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Correlation between variables is _______

A
  • Redundancy. We don’t need both variables to know the position, just one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Covariance is

A
  • Same construct as variance but when we have two variables
  • Includes variance
  • Not scaled
  • Scale dependant
17
Q

Variance is

18
Q

Correlation is

A
  • Normalised covariance = the covariance scaled by the variance
  • Takes on the propery of being between -1 and 1
19
Q

Relation of cov and cor

A
  • Both measures capture redundancy
  • Correlation is what you would get if you standardise your original observations (Z-transformation = standardisation) and measuring the covariance
20
Q

Why the n-1 in the nominator of var / cov

21
Q

How do we decide if dimensions are redundant?

A
  • If covariance / correlation is high
  • If we have dispersion between two variables , we need both dimensions
22
Q

The covariance matrix ?

A
  • The complete Covariance Matrix in 3D, C:

Blue: Variances of x,y,z
orangeL Pairwise covariances

See SPICK

23
Q

How do we use the covariance matrix to produce a more formal formulation of PCA?

A
  • We want to find a transformation, P, of the original coordinates and thus a new coordinate system for which:
    C - see spick
  • (diagonal matrix)
  • Cov(X’, Y’) = Cor(X’, Y’) = 0 → redundancy is removed!
  • Var(X’) >= 0, Var(Y’) >= 0
24
Q

What do we know from linear algebra about a symmetrical matrix A and a matrix of eigenvectors of A?

A
  • A = EDE T
  • A – any symmetric matrix
  • E – matrix of eigenvectors of A
  • D – diagonal matrix (all off-diagonal elements are zero!)
  • E T – transpose of E (rows and columns swapped)
25
Q

Eigenvector, eigenvalue problem

A
  • Important equation in math and physics!
  • In words: „matrix times a vector equals a scalar times this vector“
  • C(V) Ei = λi Ei
  • has at most min(m,n-1) meaningful (λi >0) solutions;
  • λi are eigenvalues associated with Ei
  • http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
26
Q

λ are ______ along PCS. This means that …….

A
  • Variances
  • Eigenvector with largest eigenvalue (variance) explains most of the total variance (=signal)
27
Q

Components of eigenvectors are the (unit-scaled) ______

A
  • Loadings
  • i.e. contributions of original variables to PC
28
Q

length of eigenvectors=

29
Q

PCs/eigenvectors are ______ _________ of the original variables

A
  • Linear combinations
30
Q

What is explained variance?

A
  • = how much of the total variance is captured by a particular PC
31
Q

How many PCs to consider?

A
  • VT=Σjλj = total variance
32
Q

Explained variance by PCi

A
  • = λi /VT * 100%
  • If based on correlation matrix, λ>1, are significant PCs („Kaiser-Harris criterion“) or λ greater than λ for randomized data („parallel analysis“)
33
Q

Dimensionality reduction:

A
  • It is sufficient to consider PC1 coordinates only; i.e. projection of original points onto PC1
34
Q

General Goal of PCA:

A
  • Replace a large number of correlated variables with a smaller number of uncorrelated variables while capturing as much information in the original variables as possible.
35
Q

Scree plot?

A
  • Plot showing how much of original data is explained by the different PCs
  • A Scree Plot is a simple line segment plot that shows the eigenvalues for each individual PC
36
Q

R Stuff
Where are loadings stored?

A

In pca_name$rotations

37
Q

R:
We have 300 genes and 80 samples
m=matrix(nrow=300,ncol=80)
>p=prcomp(m) # yields _____ in coordinates of _____

A

> p=prcomp(m) # yields genes in coordinates of samples

38
Q

R:
We have 300 genes and 80 samples
m=matrix(nrow=300,ncol=80)
>p=prcomp(t(m)) # yields _____ in coordinates of _____

A

> p=prcomp(t(m)) # yields samples in coordinates of genes