HC 3 - Metabolomics Data Analysis 1: Exploration and Discrimination Flashcards
hoorcollege 3
Goal of predictive multivariate analysis
Is someone sick or healthy based on metabolite profile?
What is the input data for explorative multivariate analysis?
Multivariate data matrix
Components multivariate data matrix
Rows: individuals, samples, countries
Columns: variables metabolites, genes, qualitative/quantitative
Which function is the goal for explorative multivariate analysis? (predictive)
Y = f(X)
with Y being the class: 1 or 0 in the data matrix for sick or healthy (variable of the individuals) expressed as a function containing Xvariables
Univariate analysis of metabolomics data: question to answer (goal)
Which variables show significant differences between groups/classes for one variable?
For univariate analysis, a t-test/ANOVA or Wilcoxon test can be used. Describe the difference about these approaches
-T-test or ANOVA > parametric test: there is a mathematical formula which describes the distribution of the data
-Wilcoxon > nonparametric test: no knowledge about the distribution
Problem with univariate analysis
It cannot detect multivariate discrimination
If two variables set in a plot cannot separate the groups but a clear separation is visible: what needs to be done (multivariate discrimination)?
A new variable should be made which describes the variation (which maxamizes the between group difference). Make the formula for the multivariate solution VARclass = 1Var1 + -1Var2 so that when
-Var1 > Var 2: VARclass= positive
-Var2 < Var1: VARclass= negative
Why do correlated metabolites occur?
Because of pathways and feedback mechanisms
> metabolites are dependent from each other
> both variables are needed for good discrimination
Explorative multivariate analysis: Principal component analysis (PCA): which purposes?
-Data reduction: PCA reduces large data matrix into two smaller matrices which are easier to plot and interpret
-Data exploration: PCA extracts most important factors (Principal Components) from data to describe multivariate interactions between measured variables.
-Data understanding: Use PCs to classify samples, identify compound spectra, determine biomarker etc.
What is PC1?
The component which describes the most variation in the data
Basic equation PCA
X = t1p1^T + t2p2^T … tr*pr^T + E
= TP^t + E
X > (I x J) a data matrix
T > (I x R) are the scores (per sample)
P > (J x R) are the loadings (per variable)
E > (I x J) are the residuals
R is the number of Principal Components used to describe X.
What are the scores?
The new values of the samples for PC1 and PC2 for example > gives information about the position of the samples
Why are scores written as p^T in the equation?
That means transpose the vector p (the PC vector across the multivariate data) > make it into a row
What are loadings and why are they needed?
Weights are needed for each variable to determine how the line is plotted (the line is the new PCA across all dimensions)
> P are the loadings (for each variable)
> 2 PCs means 2 loadings needed per variable.
Each PC can be written as a distinct matrix. How can the scores and loading be used to describe data? Lets say there is two sample with for the five variables the values -3, -3, 0, 3, 3. and for the second sample -2, -2, 3, 2, 2
PC1
Scores: -6 for sample1 and -4 for sample2
Loading: 0.5,0.5,0,-0.5,-0.5
This describes:
-3 -3 0 3 3
-2 -2 0 2 2
PC2
Scores: 0 for sample1 (already done) and 3 for sample2
Loading: 0,0,1,0,0
PC1 and PC2 describe
-3 -3 0 3 3
-2 -2 3 2 2
Done!
Why is it important to make a residual E and not make more PCs?
They are not insightful and you need to be able to distinguish the variation until only what seems noise remains (we do not want to describe noise)
> we only want to describe systematic variation