HC 3 - Metabolomics Data Analysis 1: Exploration and Discrimination Flashcards
hoorcollege 3
Goal of predictive multivariate analysis
Is someone sick or healthy based on metabolite profile?
What is the input data for explorative multivariate analysis?
Multivariate data matrix
Components multivariate data matrix
Rows: individuals, samples, countries
Columns: variables metabolites, genes, qualitative/quantitative
Which function is the goal for explorative multivariate analysis? (predictive)
Y = f(X)
with Y being the class: 1 or 0 in the data matrix for sick or healthy (variable of the individuals) expressed as a function containing Xvariables
Univariate analysis of metabolomics data: question to answer (goal)
Which variables show significant differences between groups/classes for one variable?
For univariate analysis, a t-test/ANOVA or Wilcoxon test can be used. Describe the difference about these approaches
-T-test or ANOVA > parametric test: there is a mathematical formula which describes the distribution of the data
-Wilcoxon > nonparametric test: no knowledge about the distribution
Problem with univariate analysis
It cannot detect multivariate discrimination
If two variables set in a plot cannot separate the groups but a clear separation is visible: what needs to be done (multivariate discrimination)?
A new variable should be made which describes the variation (which maxamizes the between group difference). Make the formula for the multivariate solution VARclass = 1Var1 + -1Var2 so that when
-Var1 > Var 2: VARclass= positive
-Var2 < Var1: VARclass= negative
Why do correlated metabolites occur?
Because of pathways and feedback mechanisms
> metabolites are dependent from each other
> both variables are needed for good discrimination
Explorative multivariate analysis: Principal component analysis (PCA): which purposes?
-Data reduction: PCA reduces large data matrix into two smaller matrices which are easier to plot and interpret
-Data exploration: PCA extracts most important factors (Principal Components) from data to describe multivariate interactions between measured variables.
-Data understanding: Use PCs to classify samples, identify compound spectra, determine biomarker etc.
What is PC1?
The component which describes the most variation in the data
Basic equation PCA
X = t1p1^T + t2p2^T … tr*pr^T + E
= TP^t + E
X > (I x J) a data matrix
T > (I x R) are the scores (per sample)
P > (J x R) are the loadings (per variable)
E > (I x J) are the residuals
R is the number of Principal Components used to describe X.
What are the scores?
The new values of the samples for PC1 and PC2 for example > gives information about the position of the samples
Why are scores written as p^T in the equation?
That means transpose the vector p (the PC vector across the multivariate data) > make it into a row
What are loadings and why are they needed?
Weights are needed for each variable to determine how the line is plotted (the line is the new PCA across all dimensions)
> P are the loadings (for each variable)
> 2 PCs means 2 loadings needed per variable.
Each PC can be written as a distinct matrix. How can the scores and loading be used to describe data? Lets say there is two sample with for the five variables the values -3, -3, 0, 3, 3. and for the second sample -2, -2, 3, 2, 2
PC1
Scores: -6 for sample1 and -4 for sample2
Loading: 0.5,0.5,0,-0.5,-0.5
This describes:
-3 -3 0 3 3
-2 -2 0 2 2
PC2
Scores: 0 for sample1 (already done) and 3 for sample2
Loading: 0,0,1,0,0
PC1 and PC2 describe
-3 -3 0 3 3
-2 -2 3 2 2
Done!
Why is it important to make a residual E and not make more PCs?
They are not insightful and you need to be able to distinguish the variation until only what seems noise remains (we do not want to describe noise)
> we only want to describe systematic variation
What does the loading say?
How each variable contributes to the multivariate direction of the PC
What does a score plot show?
The samples/individuals as dots in a plot with PC1 and PC2 as axes for example.
> the first PC does not have to distinguish the classes the best, this could also be the second PC.
What does a loading plot show? If a dot is around 0, what does this mean.
The loading plot also uses PC1 and PC2 as axes and therefore the dots do show correlation with the sample dots from the score plot. The dots are the variables (metabolites etc.) and loadings around zero mean that these variables have only small effect on the model.
The residuals (E), the non-systematic part of the data can be described as sums across objects or variables. What do high E values mean for either individuals or variables?
Individual: a high E means that the individual is poorly described by the PCA model
Variable: a high E means that the metabolite/variable is poorly described by the PCA model and might be an outlier
Why is centering and scaling of the data necessary?
Because we are interested in the differences between objects, not in their absolute values
> scaling is needed because when different variables are measured in different unts, scaling gives each variable an equal chance of contributing to the model
Mean-centering
Subtract the mean of each column from all values of each column of X to determine variation within the group and not how much the variation differs from zero
> the new means are around zero
> the column with the highest values after mean-centering show the largest variation
Why is variable/feature scaling needed
To make variables comparable to their biological effects (are large peaks more important than small peaks?) or equalize their relative importance.
Variable / Feature Scaling
Divide each value of each column of X by the scaling factor (for example standard deviation of the column).
> at the end of this, all standard deviations for the columns are 1.0
Types of scaling with scaling factors
-Range scaling: scaling factor is the biological range (Max-Min) of all values
-Standardization/autoscaling: scaling factor is standard deviation of all values.
Log transformation
Take log from each individual
> make data distribution somewhat normal (Gaussian)
Differnce centering and scaling methods with log transformation
Transformation is performed for each value instead of per column
Problems with log transformation
Small values are problematic > 0
-Square root is less problematic
When is log transformation needed?
If the data looks uneven distributed
When is log transformation performed?
Before centering and scaling
Why is Discriminant PLS needed when PCA exists?
PCA scores are good summarizers of the X data, but do not necessarly of Y (classes/groups) > first PCA doesn’t discriminate the groups in many cases.
-DPLS scores are used to predict the class Y for data X
How is Y calculated for each X variable in DPLS?
Per variable a different bPLS coefficient is calculated: regression coefficient for prediction Y.
> PLS scores are chosen so that they can predict the Y value
Formula X in DPLS
X = T-PLS * P-T + E
Regression of Class y in DPLS
y-pred = T-PLS * b-PLS + f
What does the T-PLS mean in DPLS?
The PLS scores
What does a high and low b-value value mean?
High: Y is predictable
Low: Y is not predictable very well
Difference b-value and b-PLS in DPLS
b-value gives information which PC is important for prediction in general, not per variable
What is f in the DPLS?
The residue
Two goals of DPLS
-Model of data > X formula
> is the data well described
> clustering/outliers
> do new samples fit the model of the data?
-Calibration model between scores and class > y formula
> is there relationship between T-PLS and Y
> what is the prediction error?
How is y-pred interpreted?
Negative values > Y=0
Positive values > Y=1
both binary values for Y are classes
or cutff 0.5
Prediction model for new data of DPLS
for x-new
> y-new = x-new * b-PLS
What do the PLS scores do (function)?
Describe X and predict Y