HC 7 - Metabolomics Data Analysis 2: Biomarkers & Validation Flashcards
hoorcollege 7
PCA scores and loadings
Scores: describe X
Loadings: variables important differences between samples
DPLS scores and regression coefficients
PLSscores: describe X + predict Y
PLS regression coefficient: important variables for discrimination
After a biomarker pattern is found, what is done next?
> Statistical validation and repeat steps from clean data to biomarker pattern
Biological and/or external validation
Biological validation
New experiments
> PLS calculates scores and regression coefficients, but is the analysis correct?
PLS model: multivariate model
Use coherence to make a model for discrimination of classes
> you need a PLS method which describes how the coherence between the variables is
> you want a situation with less variables than groups
> composed variables into scores
Optimistic bias in statistic models
All statistic models are too optimistic because the parameters are estimated from the data used for creating the model
Validation of model principle
Can the classification model preduct the health status of new individuals
Statistical multivariate model validation approaches
-Average prediction error
-Statistical significance (p-value) of multivariate model
Main points Average prediction error
-Confidence intervals for new samples
-Separate blinded test set
-For small number of samples: cross-validation: keep one individual away and use model to predict.
Main points statistical significance of multivariate model
-For H0 distribution > is the difference larger than expected for equal groups
-Permutations: classify each individual randomly, does this situation differ from the prediction model?
Average prediction error: test set validation
Use b-PLS coefficients to predict class of a testset from which the class is known. Measure the prediction error of the test set.
(e.g. the y-pred is 0.3 but it is a healthy individual who should be 1 and threshold is 0.5)
Summarize prediction errors in a confusion table. How?
the columns are categorized Positive and negative as true condition (for patient and healthy for example) and the rows positive and negative for prediction.
> So the cell with negative true condition and positive prediction represent the false positives, and so on
Sensitivity
True Positives / (True Positives + False Negatives)
Specificity
True Negatives / (True negatives + False positives)
Cross validation
For small sample size, divide samples in Xtest and Xtrain
> make an optimalized model M with Xtrain (M = bPLS coefficients)
> Use M to predict Y for Xtest
> Repeat with different Xtrain and Xtest until each sample has been in a test set
> Measure numbers of misclassifications
When using your training set as a test set, then …
there is a biased prediction (too optimistic for new data)