HC 2 - Data Analysis and Experimental Design (+Preprocessing) Flashcards
hoorcollege 2
Data analysis pipeline
Biological question
> Experimental design
-Power analysis
-Treatment design
> Data acquisition
-QC strategy
-Measurement design
> Data pre-processing
-Normalisation
-Quantification
> Metabolite identification
> Statistical Data analysis
-Explorative
-Predictive
-Hypothetical biomarkers
> Biological interpretation
-MSEA
-Pathway analysis
Parts of the experimental design and data collection
-Frame a biological question
> testable with statistical analysis
-Design factor (controlled, drug vs placebo) / observed factor (not controlled, like health vs ill)
> different treatments (levels) / select from predefined groups
-Identify noise factors (confounding)
Confounding noise factors
Other sources of variation that could have an effect on the study like sex, medication, bmi
Hoe worde individuen bij een design factor aan een bv placebo of drug gelinked?
At random
Design: random, blocks and replicate. Random: Blind vs Double Blind vs Triple Blind
-Blind: random selection who is getting placebo or drug
-Double blind: individuals also do not know what they got (patients and researchers)
-Triple Blind: data analysis does not know who is getting placebo or drug.
Type of questions
- Designed: Detection of responsive features (genes, proteins, metabolites) under controlled experimental conditions (perturbation study, causal relationships) > h0: gene unperturbed = gene perturbed > which genes are affected by treatment
- Biomarkers: Detection of biomarkers (observational, difference patient and control) > h0: gene patient = gene control > we dont know if difference is caused by disease
- Regulation: Identification of regulatory or mechanistic relationship between features > no relationship vs relationship (linear, exponential) > associations, correlations, more explorative analysis > measure if correlation between metabolites or genes are changed
Noise factors
Disturbing correct estimation of the effect of the experimental factor like time, temperature, gender and age.
Controlling noise factors
Taking only one gender or constant temperature
Ways to take not controllable noise factors into account:
Randomization, blocking and replication
Randomization
-Random assignment of treatments to different individuals
-Random experiments over time (time has no effect)
-Randomize sample over batches/ slides: do not measure all controls in one batch and then the treated samples in a separate batch.
When are blocks of experiments made?
If:
-Not all experiments can be done in one day
-Measured levels could be different for specific groups > e.g. men and women because that is a confounding factor
Blocking over days
Fix the samples over days: same amount of cases and controls each day
> Randomize samples within days
> ignore the effect of day
Blocking over groups
Fix treated/controls equal over groups (men/women)
> Randomize treatment / control within group
> ignore the effect of groups
Which effect is stated irrelevant when blocking, and has to be removed?
The block effect (the difference between blocks)
> correction > remove average block effect from data
The rule for blocking: fix over the blocks, but … within the blocks
randomize
Replication
-Replicate measurements
-Repeat analysis to decrease biological variation and/or analytical variation
Types of replication
-Measure more individuals per group
-Repeat treatment for an individual
-Measure sample multiple times
The mean is better estimated when more measurements are performed. Why?
Because the influence of outliers and coincidence becomes smaller
Repeatability
The degree of agreement between measurements conducted on the same sample in the same location by the same people
> which value to exprect when repeating data collection from the analysis step
Reproducibility
The degree of agreement between measurements conducted on replicate samples in different locations by different people.
> which value to expect when repeating data collection from the sample collection from the same patients
Biological variability
Variation between individuals in the same group: offset and effect-size within individuals between biological experiments
> which value to expect when repeating the data collection from patient selection
Analytical variation includes:
-Bias: mean value is not equal to actual value
-Repeatability and reproducibility
From what is the amount of individuals selection per group based?
On the statistical power cutoff: when there is a difference between groups: how often can we detect it?
How to increase power of a test?
-Increase effect size
-Decrease SDx (standard deviation of the mean) > improve measurement
-Decrease SEMx (standard error) > increase n (more replicates)
-SEM = SD / sqrt(n)
Standard error of the mean formula
SEM = SDx / sqrt(n)
Power cutoff value
> 80
What is alpha, beta and the power?
Alpha: chance that H0 is rejected but is true, and there is no difference (5% as cutoff, then we find this chance low enough to call a difference)
Beta: chance that H0 is accepted but should be rejected
Power = 1 - Beta
Types of design
-Parallel design
-Repeated measures design
Parallel design
-Measure both groups on the same time point
-Individuals are tested at one treatment
-Used when small ‘between individual’ variation (variations between individuals of the same groups)
-Test between individuals of different groups measured at same time point with t-test or ANOVA (comparison of means)
-Within group - variation is much smaller than between group variation
The reliability of a parallel design is dependent on the …
variation within a group
Parallel designs are used for … studies
animal
Repeated measures design
-Every individual gets both treatments with a time interval between the two treatments
-Use the same individuals for multiple tratments
> determine before and after treatment values after both treatment 1 and 2
When is repeated mesures design used
When the ‘between individual’ variation is large
> variation between individuals expected large compared to variation due to treatment (within individual)
> no bias between individuals because of the correction (due to between individuals variation) > more significant results
Which design corrects for the time effect of repeated measures?
Cross-over design: random treatment order assignment
Which values are taken for comaparison in repeated measures?
Means of the difference values
> the correction leads to less noise and more reliability
For which kinds of studies is repeated measures used?
Human studies
Parallel vs repeated measures
-Equal effect
-Noise (standard deviation) is different because the between individual variation is ignored in parallel design
In a multivariate data matrix: what are the rows and what are the columns?
Rows: individuals, samples or countries
Columns (variables/features): metabolites, genes, qualitative (m/v) / quantitative
What do these values mean in the multivariate data matrix: NaN, unexpected negative values, 0 values, outliers
-NaN; not a number
-Unexpected negative value: for example negative value for intensity
-0 values: below the detection limit perhaps
-Outliers: value differs a lot from the range of the other values of the column
Disturbances of a whole sample
-Amount of sample is different
-Some samples are more diluted than others
-Order of measuring affects measurement
-Dilution of samples could be different e.g. urine : why is correction needed?
to remove systematic variation between experimental conditions unrelated to the biological differences (dilutions, mass)
> due to drugs, disease, day/night rhythm the urine amount can change (and therefore dilution)
Metabolite levels are considered … and gene expression values are …
Metabolite levels: quantitative
- Normal distribution assumed
- T-test / ANOVA
Gene expression: counts
- Poisson distribution: negative binomial distribution
- not symmetric
- special tests that correct distribution
Sample normalization
Differences between individuals due to metabolic differences and dilution differences
> Corrected by a correction value
Sample normalization corrections
By
-A reference compound originally in sample (creatinine in urine f.e.)
-Total sum or total peak area
> peak area/height / total sum
-Dry mass, volume etc
Disadvantages with the correction values
-Sum of total peak areaL problem with changing profiles of large peaks that differ over individuals (could be relevant)
-Creatinine: protein from muscle degradation which highly depends on muscle weight (m/v, children/adult)
-Volume: total amount of compounds (=concentration x volume) is used (not concentration) > problem with women not emptying their bladder completely.
Disturbances of single feature of a sample
-Alignment problems due to aging of chromatographic column
-Wrong baseline measurement: unequal to 0
-Not the whole array has the same quality
What is an internal standard?
A compound added in fixed amounts to the sample before sample workup
Internal standards correction
-Peakheight for internal standard should be equal in all samples (otherwise something is wrong with the sample, correction is needed)
-To correct for variation in sample workup/measurement.
-The internal standard is expected to behave similar as other features in sample
-Ratio feature / internal standard is expected to stay constant
Quality Control (QC) samples
-To check variation in instrument over time
-Because QC sample is always the same, the measured signal is expected equal for ALL compounds in the sample
-Something is wrong with the analysis machine if the peak value for QC samples isn’t consant
When are QC samples measured?
Every other 8-10 samples
How are study samples corrected with QCs?
Use trends in signals to correct study samples inbetween QCs
What are QC samples?
Pooled samples (combination of all study samples)
Which compound is always added to the QC sample?
The internal standard
What does the QC check?
If the ratio (compund peak/ IS peak) is constant
QC correction corrects
Between batch difference and within batch differences due to drift > correction factors for QCs are applied to samples for each metabolite
Name three different characteristics between IS and QC
IS : Same effect on every other peak
QC: Study samples inbetween two QC samples are corrected
IS: Every sample has an internal standard added
QC: in QC sample all peaks are present
IS: Assumed that if the IS is 2 times higher, the other peaks (all) are 2 times higher
QC: Peak A in QC sample can go up and peak B can stay the same (correction per variable)
In large studies IS and QC samples are combined. Why?
For optimal correction of instrumental drift
Waar moeten welke correcties worden gebruikt bij: als sample van tevoren al anders zijn, voor sample workup en instrument van zelfde samples, en voor instrumenten
Samples anders van tevoren: normalisatie met creatinine, total amount of volume
Sample workup en instrument: IS
Instrument: QC
What is sample workup?
From the sample to the measurement in the machine