Statistics Flashcards

1
Q

Inferential statistics

A

Allows for generalisations to be made about a population from a sample representative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are common problems in biological data

A
  • small sample size
  • unequal sample size
  • correlation within data (measurements from subject over time, or measurements from brain regions at the same time will always be correlated)
  • unequal variance (heterogeneity)
  • non-normal (skewed) distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Discrete variables

A

variables who values are finite, or countably infinite values within a range
- eg: ‘pain relief’ vs ‘no pain relief’ or subjective rating scales

these numbers do not have the same academic integrity as continuous variables (thus means have ‘less’ meaning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

continuous variables

A

variables whose values exist on an infinite continuum/are uncountable
- e.g. frequency, temperature, amplitude, enzyme concentration, receptor density

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Binary variables

A

Yes or No outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Nominal variables

A

Represents groups with no ‘rank’ or ‘order’ within them
- eg: species, colour, brands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Ordinal variables

A

Groups that are ranked in a specific order
- eg: likert scales,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parametric statistics

A

that the data follows a normal distribution, and that there is equal variance within each group (homogeneity of variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Nonparametric statistics

A

used when the data does not follow a normal/known distribution
- tend to be less statistically powerful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Parametric test used for 2 unpaired groups

A

Unpaired t-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Non-parametric test used for 2 unpaired groups

A

Man-Whitney U test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Parametric test used for 2 paired groups

A

Paired t-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-parametric test used for 2 paired groups

A

Wilcoxon test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Parametric test used for ≥3 unmatched groups

A

1 way ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Non-parametric test used for ≥3 unmatched groups

A

Kruskal-wallis test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parametric test used for ≥3 matched groups

A

Repeated measures ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Non-parametric test used for ≥3 matched groups

A

Friedman test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Parametric test used to determine association between two variables

A

Pearson correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Non-parametric test used to determine association between two variables

A

Spearman correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Parametric test used to predict a value for one variable from other(s)

A

Simple linear/non-linear regression
Multiple linear/non-linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Parametric test used to predict a value for one variable from other(s)

A

Non-parametric regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Central limit theorem

A

As the sample size increases, the probability that the data will be normally distributed increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

The null hypothesis

A

Assumes that there is no difference between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Power

A

(1-b)
the probability of rejecting the null hypothesis
- increasing sample size results in decreased variability, and thus greater power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

a

A

Type 1 error rate; rate of incorrect rejection of null hypothesis
**this is equal to the significance level (which is typically 0.05, meaning 5% probability of falsely rejecting the null hypothesis)

26
Q

b

A

type II error rate; rate of incorrect acceptance of null hypothesis

27
Q

What metrics cannot be altered to increase power?

A
  • variability, as it is fixed depending on type of data
  • type I error
28
Q

T-test

A

Ratio of the difference between two groups in relation to a measure of variability (standard error)

29
Q

Examples of the two types of t-test

A
  • non-paired: comparing cannabis treatment to placebo treatment in different groups
  • paired: comparing cannabis treatment to saline treatment in the same group
30
Q

ANOVA

A

Analysis of variance

Used to determine whether ≥3 means are significantly different

Takes into account variance both between (treatment variance) and within (error variance) groups

31
Q

1 factor ANOVA

A

Used when examining different treatments on different groups

32
Q

Repeated measures ANOVA

A

Used when investigating different treatments on the same group

33
Q

Multilevel ANOVA

A

Used when investigating ≥2 independent variables and the interactions between then
- provides a separate F value for each independent variable and interaction

34
Q

Multiple comparisons

A

Using multiple t-tests is advised against, as it increases the type I error
- Bonferroni corrections counteracts this

Hence why ANOVAs are preferred for ≥3 groups

35
Q

Post-hoc tests

A

aka multiple comparisons tests

Used after completing an ANOVA to determine which groups are significantly different
- Dunnett test
- Tuker-kramer test

36
Q

Pseudo-replication

A

occurs when the number of measured values or data points exceeds the number of genuine replicates
- eg: confusing # slices with # animals

Leads to an inflation of sample size, thus artificial inflation of power

37
Q

Linear mixed model analysis (when used + assumptions)

A

A statistical method used when data is not independent, and errors are correlated

Assumptions:
- does not assume independence of data
- does not assume balanced design
- does not assume homogenous variance
- assumes random sampling
- covariance structure must be specified

38
Q

Covariance

A

Provides the relationship between two variables from slope gradients and sample size
- relationship defined by slope valence
- does not inform gradient or derivation

39
Q

Correlation (R)

A

Measures the degree of association between two variables
- not sensitive to scale
- quantifies strength of correlation

Defined as a number, r where -1<r<1
- 0<r<1: positive correlation
- 0 = r: no correlation
- -1<r<0: negative correlation
** closer to |1| = stronger correlation

Calculated from covariance of (x,y) with respect to individual variances of x,y
- pearsons (parametric)
- spearmans (non-parametric)

40
Q

R^2

A

The coefficient of determination:
- A metric of correlation that allows comparison of two correlations

= variance around mean - variance around line / variance around mean
= 1 - RSS/TSS

R^2 should be ≥ 0.80

eg: R^2 = 0.80
= the relationship between two variables accounts for 80% of the variation

41
Q

Regression analyses

A

Statistical method that allows examination of the relationship between 2+ variables of interest through the generation of a line of best fit
- linear or non-linear
- t-tests/ANOVA can be used to determine significance of regression

42
Q

Sum of squares

A

Total sum of squares (TSS) = variation of data about the mean

Residual sum of squares (RSS) = variation not explained by the regression line

sum of squared regression (SSR) = variance explained by regression

43
Q

Simple regression

A

A statistical method that allows examination the relationship between two variables of interest
- Calculate residual sum of squares
- Smaller RSS indicates a better fit
- used for all standard curves

44
Q

T-test for regression

A

The regression co-efficient (slope) / standard error of slope co-efficient
= b/SE(b)

  • can also be expressed as a confidence interval
    = b ± taSE(b)
  • typically set at 95%, indicating that 95% of the data will fall between a set of values
45
Q

ANOVA for regression

A

Determines whether the amount of variation accounted for by the regression line (SSR/SSE) is greater than variation NOT explained (RSS)
- signal > noise

46
Q

Assumptions for t-test/ANOVA for regression

A
  • residuals are normally distributed
  • constant variance (SD) of residuals
  • independent samples

If these are not fulfilled type I error increases

47
Q

Non-linear regression

A

A statistical test that used calculus and matrix algebra to determine the line of best fit for a non-linear relationship
- requires initial estimated parameters (mean, SD)
- can be used to interpolate values
- useful for obtaining Bmax, Ka, EC50 etc…

48
Q

Linearising transform

A

Data can be transformed so that it fits the assumptions for linear regression
eg:
- scatchard plots for binding data
- lineweaver-burke plots for enzyme kinetics
- logarithmic plots for kinetic data

TRANSFORM DISTORTS THE ERROR
- violates assumptions of regression of normal distribution of error and ~equal SE for each x value

49
Q

Issues with Scatchard plots

A

X (bound drug) is often used to calculate Y (bound/free)
i.e. the independent variable is part of the dependent)
- results in inaccurate Y values
- violates assumptions of linear regression (normal distribution and homoscedasticity; equal variance of errors)

50
Q

Multivariate statistics

A

Statistical analysis that are used when there are multiple dependent and/or independent variables
- used commonly in clinical neuropharmacology
- becoming more common in genomics and proteomics

51
Q

Multiple linear regression

A

An equation composed of multiple regression coefficients for different independent variables (x1,x2) but with a single dependent variable (y)
y = b1x1 + b2x2 +… + c

requires adjusted R^2 to take into account multiple variables as a function of sample size

52
Q

Multi-collinearity

A

Occurs when regression variables are highly correlated, resulting in an inflation estimate of variance through sum of squares
- inaccurate coefficients
- can lead to a significant F value but no significant differences between any specific groups

The highly correlated variable should be removed as they are REDUNDANT

53
Q

Principal component analysis

A
  • identifies the most important features (principal components) that contribute to variation
  • Plots these variables in order of importance according to ‘eigenvalue’
  • the second PC is always perpendicular to the first
54
Q

Discriminant analysis

A

A statistical method that helps you to identify the most important variables that distinguish the different groups in data
- Principal component analysis
- Factor analysis

55
Q

Factor analysis

A

Used to simplify complex data by identifying common factors that explain the relationships between dependent variables

56
Q

Random forest classification

A

A machine learning method that utilises multiple ‘decision trees’ and finds the average to give a final result
- can be used to determine how good an independent variable is at predicting dependent
- error plateaus after ~100 trees

57
Q

Eigenvalues

A

‘components’ or ‘factors’ (mathematically known as ‘roots’) that explain most of the variation in the data
- In an analogous way to ANOVA, these eigenvalues represent the major sources of variation in the covariance matrix

58
Q

Cluster analysis

A

An exploratory technique often used on very large data sets to show variables that typically vary together, i.e. have a relationship
- Results often shown using a ‘dendrogram’
- often requires a ‘z’ transform
- Different algorithms can be used to determine clusteres

59
Q

Canonical correlation analysis

A

used to identify and measure the associations among two sets of variables. Canonical correlation is appropriate in the same situations where multiple regression would be, but where are there are multiple intercorrelated outcome variables.

60
Q

Non-parametric multivariate analysis

A
  • few assumptions about data
    eg: random forest classification/regression, PCA, and cluster analysis
61
Q

Scree plot

A

Way of interpreting data from PCA
- plots each principal component in order based on amount of variation that it explains (eigenvalue)

62
Q
A