HC 3 - Metabolomics Data Analysis 1: Exploration and Discrimination Flashcards

hoorcollege 3

1
Q

Goal of predictive multivariate analysis

A

Is someone sick or healthy based on metabolite profile?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the input data for explorative multivariate analysis?

A

Multivariate data matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Components multivariate data matrix

A

Rows: individuals, samples, countries
Columns: variables metabolites, genes, qualitative/quantitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which function is the goal for explorative multivariate analysis? (predictive)

A

Y = f(X)
with Y being the class: 1 or 0 in the data matrix for sick or healthy (variable of the individuals) expressed as a function containing Xvariables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Univariate analysis of metabolomics data: question to answer (goal)

A

Which variables show significant differences between groups/classes for one variable?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For univariate analysis, a t-test/ANOVA or Wilcoxon test can be used. Describe the difference about these approaches

A

-T-test or ANOVA > parametric test: there is a mathematical formula which describes the distribution of the data
-Wilcoxon > nonparametric test: no knowledge about the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Problem with univariate analysis

A

It cannot detect multivariate discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If two variables set in a plot cannot separate the groups but a clear separation is visible: what needs to be done (multivariate discrimination)?

A

A new variable should be made which describes the variation (which maxamizes the between group difference). Make the formula for the multivariate solution VARclass = 1Var1 + -1Var2 so that when
-Var1 > Var 2: VARclass= positive
-Var2 < Var1: VARclass= negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do correlated metabolites occur?

A

Because of pathways and feedback mechanisms
> metabolites are dependent from each other
> both variables are needed for good discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explorative multivariate analysis: Principal component analysis (PCA): which purposes?

A

-Data reduction: PCA reduces large data matrix into two smaller matrices which are easier to plot and interpret
-Data exploration: PCA extracts most important factors (Principal Components) from data to describe multivariate interactions between measured variables.
-Data understanding: Use PCs to classify samples, identify compound spectra, determine biomarker etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is PC1?

A

The component which describes the most variation in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Basic equation PCA

A

X = t1p1^T + t2p2^T … tr*pr^T + E
= TP^t + E
X > (I x J) a data matrix
T > (I x R) are the scores (per sample)
P > (J x R) are the loadings (per variable)
E > (I x J) are the residuals
R is the number of Principal Components used to describe X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the scores?

A

The new values of the samples for PC1 and PC2 for example > gives information about the position of the samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why are scores written as p^T in the equation?

A

That means transpose the vector p (the PC vector across the multivariate data) > make it into a row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are loadings and why are they needed?

A

Weights are needed for each variable to determine how the line is plotted (the line is the new PCA across all dimensions)
> P are the loadings (for each variable)
> 2 PCs means 2 loadings needed per variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Each PC can be written as a distinct matrix. How can the scores and loading be used to describe data? Lets say there is two sample with for the five variables the values -3, -3, 0, 3, 3. and for the second sample -2, -2, 3, 2, 2

A

PC1
Scores: -6 for sample1 and -4 for sample2
Loading: 0.5,0.5,0,-0.5,-0.5
This describes:
-3 -3 0 3 3
-2 -2 0 2 2
PC2
Scores: 0 for sample1 (already done) and 3 for sample2
Loading: 0,0,1,0,0
PC1 and PC2 describe
-3 -3 0 3 3
-2 -2 3 2 2
Done!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is it important to make a residual E and not make more PCs?

A

They are not insightful and you need to be able to distinguish the variation until only what seems noise remains (we do not want to describe noise)
> we only want to describe systematic variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the loading say?

A

How each variable contributes to the multivariate direction of the PC

19
Q

What does a score plot show?

A

The samples/individuals as dots in a plot with PC1 and PC2 as axes for example.
> the first PC does not have to distinguish the classes the best, this could also be the second PC.

20
Q

What does a loading plot show? If a dot is around 0, what does this mean.

A

The loading plot also uses PC1 and PC2 as axes and therefore the dots do show correlation with the sample dots from the score plot. The dots are the variables (metabolites etc.) and loadings around zero mean that these variables have only small effect on the model.

21
Q

The residuals (E), the non-systematic part of the data can be described as sums across objects or variables. What do high E values mean for either individuals or variables?

A

Individual: a high E means that the individual is poorly described by the PCA model
Variable: a high E means that the metabolite/variable is poorly described by the PCA model and might be an outlier

22
Q

Why is centering and scaling of the data necessary?

A

Because we are interested in the differences between objects, not in their absolute values
> scaling is needed because when different variables are measured in different unts, scaling gives each variable an equal chance of contributing to the model

23
Q

Mean-centering

A

Subtract the mean of each column from all values of each column of X to determine variation within the group and not how much the variation differs from zero
> the new means are around zero
> the column with the highest values after mean-centering show the largest variation

24
Q

Why is variable/feature scaling needed

A

To make variables comparable to their biological effects (are large peaks more important than small peaks?) or equalize their relative importance.

25
Q

Variable / Feature Scaling

A

Divide each value of each column of X by the scaling factor (for example standard deviation of the column).
> at the end of this, all standard deviations for the columns are 1.0

26
Q

Types of scaling with scaling factors

A

-Range scaling: scaling factor is the biological range (Max-Min) of all values
-Standardization/autoscaling: scaling factor is standard deviation of all values.

27
Q

Log transformation

A

Take log from each individual
> make data distribution somewhat normal (Gaussian)

28
Q

Differnce centering and scaling methods with log transformation

A

Transformation is performed for each value instead of per column

29
Q

Problems with log transformation

A

Small values are problematic > 0
-Square root is less problematic

30
Q

When is log transformation needed?

A

If the data looks uneven distributed

31
Q

When is log transformation performed?

A

Before centering and scaling

32
Q

Why is Discriminant PLS needed when PCA exists?

A

PCA scores are good summarizers of the X data, but do not necessarly of Y (classes/groups) > first PCA doesn’t discriminate the groups in many cases.
-DPLS scores are used to predict the class Y for data X

33
Q

How is Y calculated for each X variable in DPLS?

A

Per variable a different bPLS coefficient is calculated: regression coefficient for prediction Y.
> PLS scores are chosen so that they can predict the Y value

34
Q

Formula X in DPLS

A

X = T-PLS * P-T + E

35
Q

Regression of Class y in DPLS

A

y-pred = T-PLS * b-PLS + f

36
Q

What does the T-PLS mean in DPLS?

A

The PLS scores

37
Q

What does a high and low b-value value mean?

A

High: Y is predictable
Low: Y is not predictable very well

38
Q

Difference b-value and b-PLS in DPLS

A

b-value gives information which PC is important for prediction in general, not per variable

39
Q

What is f in the DPLS?

A

The residue

40
Q

Two goals of DPLS

A

-Model of data > X formula
> is the data well described
> clustering/outliers
> do new samples fit the model of the data?
-Calibration model between scores and class > y formula
> is there relationship between T-PLS and Y
> what is the prediction error?

41
Q

How is y-pred interpreted?

A

Negative values > Y=0
Positive values > Y=1
both binary values for Y are classes
or cutff 0.5

42
Q

Prediction model for new data of DPLS

A

for x-new
> y-new = x-new * b-PLS

43
Q

What do the PLS scores do (function)?

A

Describe X and predict Y