HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards

Question 1

Q

PCA scores and loadings

Answer

A

Scores: describe X
Loadings: variables important differences between samples

Question 2

Q

DPLS scores and regression coefficients

Answer

A

PLSscores: describe X + predict Y
PLS regression coefficient: important variables for discrimination

Question 3

Q

After a biomarker pattern is found, what is done next?

Answer

A

> Statistical validation and repeat steps from clean data to biomarker pattern
Biological and/or external validation

Question 4

Q

Biological validation

Answer

A

New experiments
> PLS calculates scores and regression coefficients, but is the analysis correct?

Question 5

Q

PLS model: multivariate model

Answer

A

Use coherence to make a model for discrimination of classes
> you need a PLS method which describes how the coherence between the variables is
> you want a situation with less variables than groups
> composed variables into scores

Question 6

Q

Optimistic bias in statistic models

Answer

A

All statistic models are too optimistic because the parameters are estimated from the data used for creating the model

Question 7

Q

Validation of model principle

Answer

A

Can the classification model preduct the health status of new individuals

Question 8

Q

Statistical multivariate model validation approaches

Answer

A

-Average prediction error
-Statistical significance (p-value) of multivariate model

Question 9

Q

Main points Average prediction error

Answer

A

-Confidence intervals for new samples
-Separate blinded test set
-For small number of samples: cross-validation: keep one individual away and use model to predict.

Question 10

Q

Main points statistical significance of multivariate model

Answer

A

-For H0 distribution > is the difference larger than expected for equal groups
-Permutations: classify each individual randomly, does this situation differ from the prediction model?

Question 11

Q

Average prediction error: test set validation

Answer

A

Use b-PLS coefficients to predict class of a testset from which the class is known. Measure the prediction error of the test set.
(e.g. the y-pred is 0.3 but it is a healthy individual who should be 1 and threshold is 0.5)

Question 12

Q

Summarize prediction errors in a confusion table. How?

Answer

A

the columns are categorized Positive and negative as true condition (for patient and healthy for example) and the rows positive and negative for prediction.
> So the cell with negative true condition and positive prediction represent the false positives, and so on

Question 13

Q

Sensitivity

Answer

A

True Positives / (True Positives + False Negatives)

Question 14

Q

Specificity

Answer

A

True Negatives / (True negatives + False positives)

Question 15

Q

Cross validation

Answer

A

For small sample size, divide samples in Xtest and Xtrain
> make an optimalized model M with Xtrain (M = bPLS coefficients)
> Use M to predict Y for Xtest
> Repeat with different Xtrain and Xtest until each sample has been in a test set
> Measure numbers of misclassifications

Question 16

Q

When using your training set as a test set, then …

Answer

A

there is a biased prediction (too optimistic for new data)

Question 17

Q

Statistical significance validation

Answer

A

-Make permutations of class labels (or y-values)
-Make new models between X and y that should represent situations where there is no link between X and y
-Compare original prediction error with the prediction errors of many models of permutated data
-calculate p-value

Question 18

Q

How to calculate p-value for statistical test for validation of model

Answer

A

p < (1+number of permutation models better than original) / all permutations

Question 19

Q

PLS-DA validation with cross model validation and permutation tests

Answer

A

> measure misclassifications cross validation
calculate p-value with permutation models

Question 20

Q

The p-value for permutation validation is significant for metabolites which …

Answer

A

are the best indicators for the coefficients and the classes

Question 21

Q

What is the H0 for permutations?

Answer

A

During permutations, bPLS coefficients for the meaningless permutation models are made. The H0 is made from the amounts of misclassifications

Question 22

Q

What are the expected bPLS values for the permutation models?

Answer

A

There is no relationship between X and y, therefore there is no effect for this variable and they are expected to vary around 0

Question 23

Q

If the variable (metabolite/biomarker) is important, then the bPLS coefficients from the model should be …. than the bPLS coefficients from permutation

Answer

A

Larger or smaller.

Question 24

Q

Visualize a plot which shows variables raked by permutated coefficients on the x-axis and coefficients on the y-axis. Black lines are shown as a sideways parabolic around 0 and a red spikey curve inbetween with red circles where the red line crosses the black line. What could this mean?

Answer

A

-The back lines indicate the 95% confidence interval H0 from permutaiton tests (H0: PLS reg coeff = 0)
-The red line: PLS Reg Coeff from original model per variable
-Red circles: points where the original model reaches out of the 95% CI of the permutation model PLS Reg Coeff and therefore is significantly unequal to the bPLS Reg Coeffs of the permutation test. These variables are good biomarkers

Question 25

Q

Explain the following concepts in one sentence: Cross validation, training set, test set, permutation test

Answer

A

-Cross validation: with a small number of samples
-Training set: to create models (estimate loadings/coefficients)
-Test set: new data to test PLS model regression coefficients by predicting the class of new samples
-Permutation test: random ordering of class labels

Question 26

Q

Do you make a new model every time when you take individuals from the data for prediction in cross validation?

Answer

A

Yes, with cross-validation you create a new model every time, your training set is also always a little different because you leave out other samples.
- You take out a sample and use the rest as a training set and the one sample(s) you take out use it as a test set. Instead of cross-validation, you prefer to use a good test cohort in which validation of the model can be done for individuals other than those from the training set.

Question 27

Q

PCA loadings (P) give information about

Answer

A

The most important variables to describe variation between samples in the data

Question 28

Q

DPLS: bPLS coefficients give most information on

Answer

A

Important variables for discrimination

Question 29

Q

Data interpretation, what does it mean if a certain group of metabolites is more abundant in healthy or disease condition?

Answer

A

Then they may be part of a network, or pathway
> pathway analysis
> network visualization
> Comprehensive metabolite databases

Question 30

Q

Lets say you have a cumulative predictive accuracy list, and the top metabolite is maltose. What happens if we remove maltose from the prediction model?

Answer

A

The predictive accuracy becomes a lot smaller

Question 31

Q

To identify groups of metabolites which are very different between the groups and therefore predictive, what can be done with the cumulative prediction accuracy list?

Answer

A

Give metabolites color codes for different metabolite groups, like glucose metabolism for maltose

Question 32

Q

How do you recognize a specific over-represented group of metabolites in the cumulative accuracy list?

Answer

A

Lots of metabolites of same group in the top

Question 33

Q

For biological interpretation of the group of biomarkers, it is important to change interpretation from … to …

Answer

A

single metabolites to groups or pathways like biochemical pathways

Question 34

Q

Over-representation analysis input and output

Answer

A

Use a list of significant metabolites (not ordered)
> metabolite set enrichment
> over-representation analysis
> result: biological processes which are different between the groups

Question 35

Q

Metabolite set enrichment

Answer

A

Give weights to metabolites based on relevance (fold change or importance)

Question 36

Q

Single Sample Profiling

Answer

A

-Metabolite concentrations fall in normal range, if not, compare deviation pattern to known causes

Question 37

Q

Over-representation analysis uses a …. distribution

Answer

A

hypergeometric

Question 38

Q

Binomial coefficient

Answer

A

n over k > in how many ways can I select k objects from a group of n?
(n k) = n! / k!(n-k)!

Question 39

Q

Over-representation analysis using hypergeometric distribution. Use 52 metabolites divided into 4 pathway groups (A-D) of each 13 metabolites. If you select 12 metabolites, what is the probability that 10 are from pathway A (because this was actually measured)

Answer

A

Probability using binomial coefficients = ( (13|10) * (39|2) ) / (52|12)
-(13|10): 10 metabolites from 13 possible in A
-(39|2): 2 metabolites from 39 of B-D.
-(52|12): all possibilities of selecting 12 from 52.

Question 40

Q

If the probability from the over-representation is very small, then …

Answer

A

there is a systematic effect between two groups, which is important (we say the chance is too low to consider the result being chance, the disease (or class) is related to this metabolite group > pathway A differs between the two groups)

HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards

hoorcollege 7