Statistics Flashcards
Inferential statistics
Allows for generalisations to be made about a population from a sample representative
What are common problems in biological data
- small sample size
- unequal sample size
- correlation within data (measurements from subject over time, or measurements from brain regions at the same time will always be correlated)
- unequal variance (heterogeneity)
- non-normal (skewed) distribution
Discrete variables
variables who values are finite, or countably infinite values within a range
- eg: ‘pain relief’ vs ‘no pain relief’ or subjective rating scales
these numbers do not have the same academic integrity as continuous variables (thus means have ‘less’ meaning)
continuous variables
variables whose values exist on an infinite continuum/are uncountable
- e.g. frequency, temperature, amplitude, enzyme concentration, receptor density
Binary variables
Yes or No outcomes
Nominal variables
Represents groups with no ‘rank’ or ‘order’ within them
- eg: species, colour, brands
Ordinal variables
Groups that are ranked in a specific order
- eg: likert scales,
Parametric statistics
that the data follows a normal distribution, and that there is equal variance within each group (homogeneity of variance)
Nonparametric statistics
used when the data does not follow a normal/known distribution
- tend to be less statistically powerful
Parametric test used for 2 unpaired groups
Unpaired t-test
Non-parametric test used for 2 unpaired groups
Man-Whitney U test
Parametric test used for 2 paired groups
Paired t-test
Non-parametric test used for 2 paired groups
Wilcoxon test
Parametric test used for ≥3 unmatched groups
1 way ANOVA
Non-parametric test used for ≥3 unmatched groups
Kruskal-wallis test
Parametric test used for ≥3 matched groups
Repeated measures ANOVA
Non-parametric test used for ≥3 matched groups
Friedman test
Parametric test used to determine association between two variables
Pearson correlation
Non-parametric test used to determine association between two variables
Spearman correlation
Parametric test used to predict a value for one variable from other(s)
Simple linear/non-linear regression
Multiple linear/non-linear regression
Parametric test used to predict a value for one variable from other(s)
Non-parametric regression
Central limit theorem
As the sample size increases, the probability that the data will be normally distributed increases
The null hypothesis
Assumes that there is no difference between groups
Power
(1-b)
the probability of rejecting the null hypothesis
- increasing sample size results in decreased variability, and thus greater power
a
Type 1 error rate; rate of incorrect rejection of null hypothesis
**this is equal to the significance level (which is typically 0.05, meaning 5% probability of falsely rejecting the null hypothesis)
b
type II error rate; rate of incorrect acceptance of null hypothesis
What metrics cannot be altered to increase power?
- variability, as it is fixed depending on type of data
- type I error
T-test
Ratio of the difference between two groups in relation to a measure of variability (standard error)
Examples of the two types of t-test
- non-paired: comparing cannabis treatment to placebo treatment in different groups
- paired: comparing cannabis treatment to saline treatment in the same group
ANOVA
Analysis of variance
Used to determine whether ≥3 means are significantly different
Takes into account variance both between (treatment variance) and within (error variance) groups
1 factor ANOVA
Used when examining different treatments on different groups
Repeated measures ANOVA
Used when investigating different treatments on the same group
Multilevel ANOVA
Used when investigating ≥2 independent variables and the interactions between then
- provides a separate F value for each independent variable and interaction
Multiple comparisons
Using multiple t-tests is advised against, as it increases the type I error
- Bonferroni corrections counteracts this
Hence why ANOVAs are preferred for ≥3 groups
Post-hoc tests
aka multiple comparisons tests
Used after completing an ANOVA to determine which groups are significantly different
- Dunnett test
- Tuker-kramer test
Pseudo-replication
occurs when the number of measured values or data points exceeds the number of genuine replicates
- eg: confusing # slices with # animals
Leads to an inflation of sample size, thus artificial inflation of power
Linear mixed model analysis (when used + assumptions)
A statistical method used when data is not independent, and errors are correlated
Assumptions:
- does not assume independence of data
- does not assume balanced design
- does not assume homogenous variance
- assumes random sampling
- covariance structure must be specified
Covariance
Provides the relationship between two variables from slope gradients and sample size
- relationship defined by slope valence
- does not inform gradient or derivation
Correlation (R)
Measures the degree of association between two variables
- not sensitive to scale
- quantifies strength of correlation
Defined as a number, r where -1<r<1
- 0<r<1: positive correlation
- 0 = r: no correlation
- -1<r<0: negative correlation
** closer to |1| = stronger correlation
Calculated from covariance of (x,y) with respect to individual variances of x,y
- pearsons (parametric)
- spearmans (non-parametric)
R^2
The coefficient of determination:
- A metric of correlation that allows comparison of two correlations
= variance around mean - variance around line / variance around mean
= 1 - RSS/TSS
R^2 should be ≥ 0.80
eg: R^2 = 0.80
= the relationship between two variables accounts for 80% of the variation
Regression analyses
Statistical method that allows examination of the relationship between 2+ variables of interest through the generation of a line of best fit
- linear or non-linear
- t-tests/ANOVA can be used to determine significance of regression
Sum of squares
Total sum of squares (TSS) = variation of data about the mean
Residual sum of squares (RSS) = variation not explained by the regression line
sum of squared regression (SSR) = variance explained by regression
Simple regression
A statistical method that allows examination the relationship between two variables of interest
- Calculate residual sum of squares
- Smaller RSS indicates a better fit
- used for all standard curves
T-test for regression
The regression co-efficient (slope) / standard error of slope co-efficient
= b/SE(b)
- can also be expressed as a confidence interval
= b ± taSE(b) - typically set at 95%, indicating that 95% of the data will fall between a set of values
ANOVA for regression
Determines whether the amount of variation accounted for by the regression line (SSR/SSE) is greater than variation NOT explained (RSS)
- signal > noise
Assumptions for t-test/ANOVA for regression
- residuals are normally distributed
- constant variance (SD) of residuals
- independent samples
If these are not fulfilled type I error increases
Non-linear regression
A statistical test that used calculus and matrix algebra to determine the line of best fit for a non-linear relationship
- requires initial estimated parameters (mean, SD)
- can be used to interpolate values
- useful for obtaining Bmax, Ka, EC50 etc…
Linearising transform
Data can be transformed so that it fits the assumptions for linear regression
eg:
- scatchard plots for binding data
- lineweaver-burke plots for enzyme kinetics
- logarithmic plots for kinetic data
TRANSFORM DISTORTS THE ERROR
- violates assumptions of regression of normal distribution of error and ~equal SE for each x value
Issues with Scatchard plots
X (bound drug) is often used to calculate Y (bound/free)
i.e. the independent variable is part of the dependent)
- results in inaccurate Y values
- violates assumptions of linear regression (normal distribution and homoscedasticity; equal variance of errors)
Multivariate statistics
Statistical analysis that are used when there are multiple dependent and/or independent variables
- used commonly in clinical neuropharmacology
- becoming more common in genomics and proteomics
Multiple linear regression
An equation composed of multiple regression coefficients for different independent variables (x1,x2) but with a single dependent variable (y)
y = b1x1 + b2x2 +… + c
requires adjusted R^2 to take into account multiple variables as a function of sample size
Multi-collinearity
Occurs when regression variables are highly correlated, resulting in an inflation estimate of variance through sum of squares
- inaccurate coefficients
- can lead to a significant F value but no significant differences between any specific groups
The highly correlated variable should be removed as they are REDUNDANT
Principal component analysis
- identifies the most important features (principal components) that contribute to variation
- Plots these variables in order of importance according to ‘eigenvalue’
- the second PC is always perpendicular to the first
Discriminant analysis
A statistical method that helps you to identify the most important variables that distinguish the different groups in data
- Principal component analysis
- Factor analysis
Factor analysis
Used to simplify complex data by identifying common factors that explain the relationships between dependent variables
Random forest classification
A machine learning method that utilises multiple ‘decision trees’ and finds the average to give a final result
- can be used to determine how good an independent variable is at predicting dependent
- error plateaus after ~100 trees
Eigenvalues
‘components’ or ‘factors’ (mathematically known as ‘roots’) that explain most of the variation in the data
- In an analogous way to ANOVA, these eigenvalues represent the major sources of variation in the covariance matrix
Cluster analysis
An exploratory technique often used on very large data sets to show variables that typically vary together, i.e. have a relationship
- Results often shown using a ‘dendrogram’
- often requires a ‘z’ transform
- Different algorithms can be used to determine clusteres
Canonical correlation analysis
used to identify and measure the associations among two sets of variables. Canonical correlation is appropriate in the same situations where multiple regression would be, but where are there are multiple intercorrelated outcome variables.
Non-parametric multivariate analysis
- few assumptions about data
eg: random forest classification/regression, PCA, and cluster analysis
Scree plot
Way of interpreting data from PCA
- plots each principal component in order based on amount of variation that it explains (eigenvalue)