Chapter 11 + 12 Flashcards
Corresponding to the HC's of week 1 of the course's second part
Similarities between data analysis pipelines between different omics approaches
-All technologies yield many measurements for each sample
-Same way of handling dimensionality
-yields hundreds or thousands of variables per sample like different genes, proteins or metabolites
Samples organised in matrix
Rows: the samples
Columns: the variables (like genes)
Four components of the generalized data analysis pipeline
- Experimental design and data collection
- Data preprocessing and quality control
- Data analysis
- Biological interpretation
The experimental design can have large impact on the statistical power and therefore the …
Conclusions that are reached
First step in experimental design
Frame a biological question
What is the aim of the biological question?
Determine the hypothesis that will be tested and the statistical test that will be executed.
What does the biological question determine which is needed for an interpretable and successful outcome?
The experimental preconditions
Three types of main objectives which require a different type of experimental design
- Detection of responsive features under controlled experimental conditions (perturbation study)
- detection of biomarkers
- identification of regulatory or mechanistic relationships between variables
Experimental designing after biological question
Identify noise factors and design the experiment
Noise factors
Factors that can disturb a proper measurement (from the biological experiment up to and including the measurement)
Noise factors can lead to …
bias
Three basic principles to deal with noise factors.
- Replication
- Randomization
- Blocking
What is the aim of the experimental design?
Ensure reliable measurements free from bias
Replication
Duplicate, repeat or perform the same measurement more than once
> obtain an estimate of the experimental error
On what factor is the type of error which is estimated with replication dependent?
On how the replication is done
> For estimating and controlling biological variability: different organisms or batches of cells samples should be processed in the same manner.
Types of replication errors
-Repeatability: error based on repeats of sample measurement (same sample)
-Reproducibility: error based on sample workup or sampling/the whole experiment (larger errors)
Types of replicates
-Biological replicates: error based on the whole experiment (also the organisms) > not interested in 1 individual
-Technical replicates: to gain statistical power
Randomization
Requiring the experimenter to use random choices for every factor that is not of interest but might influence the outcome of the experiment
> random selection of individuals for groups
> hybridization of mRNA samples from treatment and control group: sensitive for external factors: important to not measure all controls first and then all treated: impossible to distinguish between time effect (not interesting) and treatment effect (interesting)
Confounder
A Confounder is a variable whose presence affects the variables being studied so that the results do not reflect the actual relationship (e.g. time: randomize over time to eliminate the bias)
Blocking
Arranging experimental samples in groups (blocks) that are similar to one another
(e.g. gender, or different columns)
> but within the groups the variation of treated/control needs to be similar
> or: blocks because not all measurements can be done on one day
> eliminating confounding effect of gender or LC column
General rule for blocking
Block what you can, randomize what you cannot (treated/control is not blockable)
Which instruments show drift in time?
GCMS and LCMS (for metabolomics and proteomics)
Where is the order in which the samples defined and why is it of importance?
In the measurement design: important because in LCMS or GCMS when the number of samples is large and several batches are needed, instrumental drift causes samples to be measured in the beginning to be slightly different than when measured at the end of the series.
Why is randomization crucial in different batches in the LCMS or GCMS
Because of instrumental drift, when no randomization is performed, the observed difference could be only due instrumental drift and there is bias. the actual results are not destinguishable from the bias.
In data preprocessing: disturbances need to be removed from data which can enter during sampling, sample workup, measurement. Which two types of disturbances do we know?
-Disturbances of a whole sample
> different amount of sample measured
> different dilution of samples
> sample workup unequal
> effect of order of measuring (begin/end of day)
-Disturbances of a single variable within a sample (e.g. singel metabolite)
Which methods of preprocessing are used for correction of whole sample disturbances?
Normalization methods
Normalization methods
-Internal standard
-QC samples
Internal standard
Compound added to each sample in equal amount which does not occur naturally
> intensity of the standard has to be the same in all samples
> difference across samples: correction of all variables with same factor
Quality control samples
For correction of instrumental drift
> pooled samples are used
> after each 8 samples the QC sample is measured
> many QC sample measurements over whole day
> intensity of each metabolite should be the same but due to instrumental drift it may not be the same at different time points
> use differences to correct studied samples inbetween the QC samples
Normalization
Correct for different dilutions e.g. urine is less diluted in the morning: correction by certain ‘concentration measure’ > not true concentrations anymore but samples are better comparable.
Which correction methods are used for single variable disturbances due to column aging in LCMS/GCMS?
Alignment methods for aligning peaks at different retention times for different samples such it is clear they belong to the same variable (metabolite/protein)
Correction methods for baseline (background signal) is unequal to zero?
Background correction methods
How is clean data stored after data preprocessing?
data matrix
> for metabolomics data, after preprocessing and normalization: normalized data matrix: starting point for data analysis and biological interpretation
What can a zero mean in the data matrix?
Not present, or below detection limit.
Are different variables always measured in the same units?
No, but often yes
11.4 data analysis