HC 3 - Metabolomics Data Analysis 1: Exploration and Discrimination Flashcards by Tobias H

Goal of predictive multivariate analysis

Is someone sick or healthy based on metabolite profile?

How well did you know this?

Not at all

Perfectly

What is the input data for explorative multivariate analysis?

Multivariate data matrix

How well did you know this?

Not at all

Perfectly

Components multivariate data matrix

Rows: individuals, samples, countries
Columns: variables metabolites, genes, qualitative/quantitative

How well did you know this?

Not at all

Perfectly

Which function is the goal for explorative multivariate analysis? (predictive)

Y = f(X)
with Y being the class: 1 or 0 in the data matrix for sick or healthy (variable of the individuals) expressed as a function containing Xvariables

How well did you know this?

Not at all

Perfectly

Univariate analysis of metabolomics data: question to answer (goal)

Which variables show significant differences between groups/classes for one variable?

How well did you know this?

Not at all

Perfectly

For univariate analysis, a t-test/ANOVA or Wilcoxon test can be used. Describe the difference about these approaches

-T-test or ANOVA > parametric test: there is a mathematical formula which describes the distribution of the data
-Wilcoxon > nonparametric test: no knowledge about the distribution

How well did you know this?

Not at all

Perfectly

Problem with univariate analysis

It cannot detect multivariate discrimination

How well did you know this?

Not at all

Perfectly

If two variables set in a plot cannot separate the groups but a clear separation is visible: what needs to be done (multivariate discrimination)?

A new variable should be made which describes the variation (which maxamizes the between group difference). Make the formula for the multivariate solution VARclass = 1Var1 + -1Var2 so that when
-Var1 > Var 2: VARclass= positive
-Var2 < Var1: VARclass= negative

How well did you know this?

Not at all

Perfectly

Why do correlated metabolites occur?

Because of pathways and feedback mechanisms
> metabolites are dependent from each other
> both variables are needed for good discrimination

How well did you know this?

Not at all

Perfectly

Explorative multivariate analysis: Principal component analysis (PCA): which purposes?

-Data reduction: PCA reduces large data matrix into two smaller matrices which are easier to plot and interpret
-Data exploration: PCA extracts most important factors (Principal Components) from data to describe multivariate interactions between measured variables.
-Data understanding: Use PCs to classify samples, identify compound spectra, determine biomarker etc.

How well did you know this?

Not at all

Perfectly

What is PC1?

The component which describes the most variation in the data

How well did you know this?

Not at all

Perfectly

Basic equation PCA

X = t1p1^T + t2p2^T … tr*pr^T + E
= TP^t + E
X > (I x J) a data matrix
T > (I x R) are the scores (per sample)
P > (J x R) are the loadings (per variable)
E > (I x J) are the residuals
R is the number of Principal Components used to describe X.

How well did you know this?

Not at all

Perfectly

What are the scores?

The new values of the samples for PC1 and PC2 for example > gives information about the position of the samples

How well did you know this?

Not at all

Perfectly

Why are scores written as p^T in the equation?

That means transpose the vector p (the PC vector across the multivariate data) > make it into a row

How well did you know this?

Not at all

Perfectly

What are loadings and why are they needed?

Weights are needed for each variable to determine how the line is plotted (the line is the new PCA across all dimensions)
> P are the loadings (for each variable)
> 2 PCs means 2 loadings needed per variable.

How well did you know this?

Not at all

Perfectly

Each PC can be written as a distinct matrix. How can the scores and loading be used to describe data? Lets say there is two sample with for the five variables the values -3, -3, 0, 3, 3. and for the second sample -2, -2, 3, 2, 2

PC1
Scores: -6 for sample1 and -4 for sample2
Loading: 0.5,0.5,0,-0.5,-0.5
This describes:
-3 -3 0 3 3
-2 -2 0 2 2
PC2
Scores: 0 for sample1 (already done) and 3 for sample2
Loading: 0,0,1,0,0
PC1 and PC2 describe
-3 -3 0 3 3
-2 -2 3 2 2
Done!

How well did you know this?

Not at all

Perfectly

Why is it important to make a residual E and not make more PCs?

They are not insightful and you need to be able to distinguish the variation until only what seems noise remains (we do not want to describe noise)
> we only want to describe systematic variation

How well did you know this?

Not at all

Perfectly

What does the loading say?

Study These Flashcards

How each variable contributes to the multivariate direction of the PC

What does a score plot show?

Study These Flashcards

The samples/individuals as dots in a plot with PC1 and PC2 as axes for example.
> the first PC does not have to distinguish the classes the best, this could also be the second PC.

What does a loading plot show? If a dot is around 0, what does this mean.

Study These Flashcards

The loading plot also uses PC1 and PC2 as axes and therefore the dots do show correlation with the sample dots from the score plot. The dots are the variables (metabolites etc.) and loadings around zero mean that these variables have only small effect on the model.

The residuals (E), the non-systematic part of the data can be described as sums across objects or variables. What do high E values mean for either individuals or variables?

Study These Flashcards

Individual: a high E means that the individual is poorly described by the PCA model
Variable: a high E means that the metabolite/variable is poorly described by the PCA model and might be an outlier

Why is centering and scaling of the data necessary?

Study These Flashcards

Because we are interested in the differences between objects, not in their absolute values
> scaling is needed because when different variables are measured in different unts, scaling gives each variable an equal chance of contributing to the model

Mean-centering

Study These Flashcards

Subtract the mean of each column from all values of each column of X to determine variation within the group and not how much the variation differs from zero
> the new means are around zero
> the column with the highest values after mean-centering show the largest variation

Why is variable/feature scaling needed

Study These Flashcards

To make variables comparable to their biological effects (are large peaks more important than small peaks?) or equalize their relative importance.

Variable / Feature Scaling

Divide each value of each column of X by the scaling factor (for example standard deviation of the column). > at the end of this, all standard deviations for the columns are 1.0

Types of scaling with scaling factors

-Range scaling: scaling factor is the biological range (Max-Min) of all values -Standardization/autoscaling: scaling factor is standard deviation of all values.

Log transformation

Take log from each individual > make data distribution somewhat normal (Gaussian)

Differnce centering and scaling methods with log transformation

Transformation is performed for each value instead of per column

Problems with log transformation

Small values are problematic > 0 -Square root is less problematic

When is log transformation needed?

If the data looks uneven distributed

When is log transformation performed?

Before centering and scaling

Why is Discriminant PLS needed when PCA exists?

PCA scores are good summarizers of the X data, but do not necessarly of Y (classes/groups) > first PCA doesn't discriminate the groups in many cases. -DPLS scores are used to predict the class Y for data X

How is Y calculated for each X variable in DPLS?

Per variable a different bPLS coefficient is calculated: regression coefficient for prediction Y. > PLS scores are chosen so that they can predict the Y value

Formula X in DPLS

X = T-PLS * P-T + E

Regression of Class y in DPLS

y-pred = T-PLS * b-PLS + f

What does the T-PLS mean in DPLS?

The PLS scores

What does a high and low b-value value mean?

High: Y is predictable Low: Y is not predictable very well

Difference b-value and b-PLS in DPLS

b-value gives information which PC is important for prediction in general, not per variable

What is f in the DPLS?

The residue

Two goals of DPLS

-Model of data > X formula > is the data well described > clustering/outliers > do new samples fit the model of the data? -Calibration model between scores and class > y formula > is there relationship between T-PLS and Y > what is the prediction error?

How is y-pred interpreted?

Negative values > Y=0 Positive values > Y=1 both binary values for Y are classes or cutff 0.5

Prediction model for new data of DPLS

for x-new > y-new = x-new * b-PLS

What do the PLS scores do (function)?

Describe X and predict Y

HC 3 - Metabolomics Data Analysis 1: Exploration and Discrimination Flashcards

hoorcollege 3 (43 cards)