Metabolomics 5 - Basic statistics Flashcards
Omics Data Analysis
DATA PROCESSING & QC
Omics data
- NMR
- Mass spectrometry
STATISTICAL ANALYSIS AND VISUALIZATION
- Comparison Clustering Classification
FUNCTIONAL INTERPRETATION
Omics-specific
- enrichment analysis
- pathway analysis
UNIQUE FUNCTIONS
Field specific
- dose response
- biomarker analysis
Types of Metabolomics Data
Raw data (fingerprinting)
* No information on metabolites
* Use raw NMR spectra or MS data
* Long-time standard in NMR
* Goal: Derive classes and identify markers * STOCSY for correlation tests
Metabolite concentrations (lists of compounds with concentration values)
* MS and NMR analyses now produce lists of metabolite concentrations
* Concentrations can be used for univariate tests
* Concentrations can also be used in very specific profiling#
* Correlations and covariances are commonly used
Common Terms
Dimension
* The number of variables (metabolites, peaks)
Univariate:
* Analysing one variable per subject
Multivariate
- Analysing many variables per subject
- Omics data are usually high-dimensional data
Basic statistical terms
Mean
- synonyms: average
Median
- the value that one-half of the data lies above and below
- synonyms: 50th percentile
Variance
- the sum of squared deviations from the mean divided by n-1 where n is the number of data values
- synonyms: mean squared error
Order statistics
* Metrics based on the data values sorted from smallest to biggest.
* Synonyms: ranks
Percentile
* The value such that P percent of the values take on this value or less and (100–P) percent take on this value or more.
* Synonyms: quantile
Interquartile range
- The difference between the 75th percentile and the 25th percentile.
- Synonyms: IQR
Variance
Why square?
- Eliminated negatives
The standard deviation is much easier to interpret than the variance since it is on
the same scale as the original data. Still, with its more complicated and less
intuitive formula, it might seem peculiar that the standard deviation is preferred in
statistics over the mean absolute deviation. It owes its preeminence to statistical
- Parabolic behaviour: Increasing contribution further from the mean
Standard Deviation:
stddev = sqrt(s)
theory: mathematically, working with squared values is much more convenient
Shows Variation about the mean, same units as original data intuitive formula, it might seem peculiar that the standard deviation is preferred in
Why square?
- Eliminated negatives
- Parabolic behaviour: Increasing
The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data. Still, with its more complicated and less statistics over the mean absolute deviation. It owes its preeminence to statistical theory: mathematically, working with squared values is much more convenient
Why Divide By (n‐1), not n ?
If you knew the sample mean and
=> n-1 degrees of freedom
Box-and-whisker plot
- The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger
- Q2 is the same as median (50% are smaller and 50% larger)
- Q3 only 25% of the observations are larger
- Inter Quartile Range (IQR) is Q3-Q1. It covers 50% of the observations
Percentiles
In general the nth percentile is a value such that n% of the observations fall at or below it
Q1 = 35th percentile
Median = 50th percentile
Q2 = 75th percentile
Other common distributions
- unimodal
- bimodal
- skewed
Mean vs median - which is best?
- Mean is best for symmetric distributions without outliers
- Median is useful for skewed distributions of data with outliers
From samples to populations
So how do we know whether the effect observed in our sample was genuine?
- We don’t
Instead we use p values to indicate our level of uncertainty that our results represent a genuine effect present in the whole population
p-Values
- P-values = the probability that the observed result was obtained by chance
-> i.e. the null hypothesis H0 is true - If that probability (p-value) is small, it suggests the observed result cannot be easily explained by chance
- P-values: a measurement that assumes the null hypothesis is correct, meaning that if the value is small, then you can reject the null hypothesis in favour of the alternative hypothesis.
- A large p-value typically means that the data point or set you measured aligns with the null hypothesis, making it the more likely outcome.
Hypothesis Testing
The null hypothesis H0
- No statistical significance between an observed result and the data set to which it belongs
- There is no difference between the case and control groups
- H0: μ1-μ2 = 0
The alternative hypothesis
- Opposite of the null hypothesis
- Hypothesis with statistical significance
- Generally the hypothesis that is believed by the researcher
- HA: μ1-μ2 ≠ 0
p Values and level of significance
- P-value: Probability of an extreme result occurring (outside red line)
- Level of significance, ɑ: Specified to define the rejection area
- Rejection region: all values for which H0 will be rejected
- P-value: Probability of an observed result (or more extreme) assuming that
the null hypothesis is true - Between the lines: 95% probability that value is > left line and < right line
How to calculate p-values: - Add up percentages of areas under the curve for the wings and divide by the total area under the curve
-> In other words, there is a 95% probability that each time we measure a Brazilian woman, their height will be between 142 and 169 cm.
Empirical p-values
- Parametric: p-values are based on well defined models, Gaussian distributions, Poisson distribution
- What if we don’t know the distribution?
-> The only thing we know is that the data does not follow a normal distribution - We can find out the null distribution from the data itself, then calculate the p-value
-> Also known as empirical p-values
One sample t-test
- one-sample t-test is used to compare the mean m of one sample to a known standard (or theoretical/hypothetical) mean (μ)
m = sample mean, n = sample size, μ = theoretical mean, s = stddev
Research question:
- whether the mean (m) of the sample is equal to the theoretical mean (μ)?
- whether the mean (m) of the sample is less or greater than the theoretical mean (μ)?
Two samples t-test -> unpaired t-test
- The unpaired two-samples t-test is used to compare the mean of two independent groups.
- Example: Measured weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (mA) is significantly different from that of men (mB).
-> Two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t- test to evaluate whether the means are different. - Research question:
-> whether the mean of group A (mA) is equal to the mean of group B (mB)?
-> whether the mean of group A (mA) is less or greater than the mean of group B
(mB)? - Classical t-test:
-> If the variance of the two groups are equivalent (homoscedasticity)
Two samples t-test-> Classical t-test
- If the variance of the two groups are equivalent (homoscedasticity)
- mA and mB represent the mean value of the group A and B, respectively. nA and nB represent the sizes of the group A and B, respectively.
- S2 is an estimator of the pooled variance of the two groups.
Two samples t-test -> Welch t-statistics
- If the variances of the two groups being compared are different (heteroscedasticity), it’s possible to use the Welch t test, an adaptation of Student t-test (no pooled variance S).
ANOVA test
- The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of the independent two-sample t-test for comparing means in a situation where there are more than two groups.
- In one-way ANOVA, the data is organised into several groups based on one single grouping variable (also called factor variable).
- ANOVA test hypotheses:
-> Null hypothesis: the means of the different groups are the same
-> Alternative hypothesis: At least one sample mean is not equal to the others.
ANOVA test -> What is calculated?
Assume that 3 groups (A, B, C) to compare:
- Compute the common variance, which is called variance within samples (S2within) or residual variance.
- Compute the variance between sample means as follow:
- Compute the mean of each group
- Compute the variance between sample means (S2between)
- Produce F-statistic as the ratio of S2between/S2within