Metabolomics 5 - Basic statistics Flashcards

Question

Spearman rank correlation -> Data values are replaced by rank -> Assumptions

Answer 1

- pairs of observations are independent - two variables should be measured on an ordinal, interval or ratio scale - it assumes that there is a monotonic relationship between the two variables

Answer 2

rho = corr(X) returns a matrix of the pairwise linear correlation coefficient between each pair of columns in the input matrix X. rho = corr(X,Y) returns a matrix of the pairwise correlation coefficient between each pair of columns in the input matrices X and Y. [rho,pval] = corr(X,Y) also returns pval, a matrix of p-values for testing the hypothesis of no correlation against the alternative hypothesis of a nonzero correlation.

Answer 3

* Non-parametric correlations are less powerful because they use less information in their calculations. * In the case of Pearson correlation uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs. * In the case of non-parametric correlation, it’s possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required. * But in the case of Pearson correlation, it assumes the distributions of X and Y should be normal distribution and also be continuous. * Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall) relationships. Kendall uses a different ordering algorithm than Spearman * In the normal case, Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples or some outliers. * Kendall correlation has an O(n^2) computation complexity comparing with O(n logn) of Spearman correlation, where n is the sample size. * Spearman’s rho usually is larger than Kendall’s tau. * The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.

Answer 4

- STOCSY (Statistical Correlation Spectroscopy) is commonly used to detect correlated and anti-correlated changes - Correlated changes: A second metabolite is up together with another - Anti-correlated: If 1 goes up, two is down (e.g. glucose consumption and lactate production in glycolysis)

Answer 5

* Variables must be in same units * Emphasizes variables with most variance

Answer 6

* Variables are standardized (mean 0.0, SD 1.0) * Variables can be in different units * All variables have same impact on analysis

Answer 7

* PCA finds the larges correlations between different variables within a data set -> PCA is closely linked to covariance matrices * PCA finds combinations of variables, or factors, that describe major trends in the data * Linear transformation of variables so that: -> Data is described with a minimal set of new variables (measure for relevance: max. variance) * The Principal Component Analysis (PCA) is equivalent to fitting an n-dimensional ellipsoid to the data, where the eigenvectors of the covariance matrix of the data set are the axes of the ellipsoid. * The eigenvalues represent the distribution of the variance among each of the eigenvectors. * Let’s start with a n=2 dimensional example to perform a PCA without the use of the MATLAB function pca, but with the function of eig for the calculation of eigenvectors and eigenvalues. * Purpose: - explorative data analysis discover correlations in 2D plots - Modelling (e.g. regression) with new data - Minimize danger of artifacts - Data reduction - Eliminate irrelevant information (noise)

Answer 8

1901 Karl Pearson, Mathematician, Philosophical Magazine 1933 Harold Hotelling, Statistician Louis Leon Thurstone, Director of the Psychometric Laboratory of North Carolina, USA “Multiple Factor Analysis” 1960 Edmund Malinowski, Bruce Kowalski Introduction of PCA into chemistry as “factor analysis” Also called Karhunen-Loeve-Transformation (KLT)

Answer 9

* Rotation of the coordinate system (multiplication with an orthogonal matrix) * Information content of higher principal components decreases rapidly => can be omitted without loss * Measure for information content is the amount of total variance covered * "the maximum variance" and "the minimum error" are reached at the same time

Answer 10

Matlab code: pca_2d_example.m * Find the unit vector pointing into the direction with the largest variance within the bivariate data set data. * The solution of this problem is to calculate the largest eigenvalue D of the covariance matrix C and the corresponding eigenvector V 𝐶 ∗ 𝑉=𝜆 ∗ 𝑉 which is equivalent to (𝐶−𝐷∗𝐸) ∗𝑉=0 where 𝐸 is the identity matrix

Answer 11

* The rotation creates new variables which are uncorrelated, i.e. the covariance is zero for all pairs of the new variables. * The decorrelation is achieved by diagonalizing the covariance matrix C. The eigenvectors V belonging to the diagonalized covariance matrix are a linear combination of the old base vectors, thus expressing the correlation between the old and the new time series. * The eigenvalues D of the covariance matrix gives the variance within the new coordinate axes, i.e. the principal components. * The mathematical procedure involves calculating the determinant of C det(C-D*E)=0

Answer 12

* Matrix X of rank r is expressed as a sum of matrices M of rank 1 * For an p x n matrix with p>n the rank is r ≤ p * Rank: order of the largest quadratic submatrix with determinant ≠ 0 (submatrix is formed from a matrix by eliminating lines or columns) * Rank = number of independent variables (when a line or column can be expressed as a linear combination of other lines or columns it is not independent) * A quadratic matrix with determinant = 0 is singular and cannot be inverted * Vectors t and p are chosen so that * p vectors are pairwise orthonormal * t vectors are orthogonal * Each t vector (scores, new coorrdinate) contains the maximum of the residual variance)

Answer 13

X is a matrix with our data, n x p sized, p data points in each spectrum (variables), n the number of spectra (samples) Let’s assume it is mean centered, i.e. X = Xorig – Xmean The covariance matrix is defined by C=X X^T /(n-1) It is a p x p matrix which can be diagonalised: C = V L VT Where V is an Eigenvector matrix and L is a diagonal matrix with Eigenvalues gammai

Answer 14

Eigenvalues of XX^T are Λ = S2 U = Eigenvalues of XX^T V = Eigenvalues of X^TX i.e. transposition of X interchanges loadings and scores.

Answer 15

C = X^TX / (n-1) where X is mean centered C = V L V^T Where V is an Eigenvector matrix and L is a diagonal matrix with Eigenvalues gammai The Eigenvectors are called principal axes or principle directions of the data Projections of the data on the principle axis (calculated as XV) are called principle components or PC scores The trick is not to calculate EVs for C, but just to calculate a singular value decomposition for X X = USV^T Singular value decomposition This is just a matrix decomposition that yields the X matrix, U and V are orthonormal UU^T = I and VV^T =I S is a diagonal matrix with singular values si ordered by size, gammai = si/(n-1) C = X^T X /(n-1) = VSU^T USV^T/(n-1) = V S^2/(n-1) V^T = V L V^T singular vectors of V are principle directions Principle components are: XV = USV^T = US Columns of US are scores

Answer 16

C = X^TX / (n-1) where X is mean centered X = USV^T Singular value decomposition S is a diagonal matrix with singular values si ordered by size, li = si/(n-1) C = X^T X /(n-1) = VSU^T USV^T/(n-1) = V S^2/(n-1) V^T = V L V^T singular vectors of V are principle directions Principle components are: XV = USV^T = US PCs: Columns of US, Scores: sqrt(n-1)U Loadings: Columns of VS/sqrt(n-1) or C*V Multiplying the first k PCs by the corresponding principal axes Vk^⊤ yields Xk = UkSkVk^⊤ matrix that has the original n×p size but is of lower rank (of rank k). This matrix Xk provides a reconstruction of the original data from the first k PCs.

Answer 17

Supervised learning -> Input and Output data -> Predictions Unsupervised learning -> Input data -> pattern or structure discovery

Answer 18

* When the experimental effects are subtle or moderate, PCA will not show good separation patterns * PLS-DA is a supervised method, it is calculated by maximizing the co-variance between the data matrix (X) and the class labels (Y) * PLS-DA always produces certain separation patterns with regard the conditions The loadings plot shows the variable influence on the separation.

Answer 19

* PLS-DA is susceptible to over-fitting by producing patterns of separation even for data randomly drawn from the same population * Need cross validation * Need permutation tests -> there is a chance for underfitting and overfitting

Answer 20

Goal: test whether your model can predict class labels for new samples

Answer 21

- leave-one out (n-fold cross-validation)

Answer 22

* Cross validation is used to determine the optimal number of components needed to build the PLS-DA model. * Three common performance measures: -> Sum of squares captured by the model (R2) -> Cross-validated R2 (also known as Q2) -> Prediction accuracy

Answer 23

Goal: to test whether your model is significantly different from the null models 1. Randomly shuffle the class labels (y) and build the (null) model between new y and x; 2. Test whether there is still the similar distances of separation; 3. We can compute empirical p values -> If the result is similar to the permuted results (i.e. null model), then we can NOT say y and x are significantly correlated

Answer 24

* Variable Importance in Projection (VIP) scores estimate the importance of each variable in a PLS- DA model -> Weighted sum of the squared correlations between the PLS-DA components and the original variable -> Weights correspond to the percentage variation explained by the PLS-DA component in the model * VIP >1 can be considered important * VIP <1 is less important and might be a good candidate for exclusion from the model

Answer 25

* Basic concepts -> True positives (TP) -> True negatives (TN) -> False positives (FP) -> False negatives (FN). -> Sensitivity (Sn) -> Specificity (Sp) * Sn (sensitivity) = True positive rate * Sp (specificity) = True negative rate

Answer 26

* ROC = Receiver Operating Characteristic -> A historic name from radar studies -> Very popular in biomedical applications --> To assess performance of classifiers. --> To compare different biomarker models * A graphical plot of the true positive rate (TPR) vs. false positive rate (FPR), for a binary classifier (i.e. positive/negative) as its cutoff point is varied

Answer 27

* Overall measure of test performance * Comparisons between two tests based on differences between (estimated) AUC

Metabolomics 5 - Basic statistics Flashcards

(51 cards)