Statistics Flashcards
Name two types of research
- Quantitative
- Qualitative
Define qualitative research
Qualitative involves meaning, opinion, attitudes and beliefs, seeking deep information, answering complex questions, social understanding
Define quantitative research
Quantitative involves numbers, proportions, statistics, testing hypotheses, looking at cause and effect
Name 4 types of data
- Categorical (discrete)
- Numerical (continuous)
- Calculated
- Censored
What type of data have you got?
Smokers / Non-smokers
Categorical
Two possible categories
Binary
What type of data have you got?
Married / Single / Divorced / Widowed
Categorical
More than two categories
No particular order
Nominal
What type of data have you got?
Strongly agree / Agree / Neither agree nor disagree / Disagree / Strongly disagree
OR
BPE index (0, 1, 2, 3, 4)
Categorical
More than two categories
Order is important, but no numerical relationship between numbers
Ordinal
What type of data have you got?
Number of sick days taken
OR
Number of fillings
Numerical
On a scale
Integers only along the scale; numbers are related
Count
What type of data have you got?
Temperature in degrees Celsius
Numerical
No real zero
Can have data at any point along the scale
Interval – 20.4 degrees Celsius is not twice as hot as 10.2 degrees Celsius
What type of data have you got?
Height in cm
Numerical
You can have zero cm
Can have data at any point along the scale
Ratio – 20.4 cm is twice as high as 10.2 cm
What type of data have you got?
BMI (Body Mass Index)
OR
HAD score (Hospital Anxiety and Depression score)
Calculated
The data have been derived from a calculation based on other measurements
Summarize the types of data (4 types)
Categorical
- Binary
- Nominal
- Ordinal
Numerical
- Count
- Interval
- Ratio
Calculated- e.g. BMI
Censored - e.g. loss to follow-up
Why aren’t the mean, mode and median the same?
5, 0, 2, 5, 24, 0, 7, 1, 15, 0, 16, 2, 4, 6, 2, 3, 5, 10, 3, 5
Mean = 5.75
Median = 4.5
Mode = 5
The data may be “skewed”
Describe normal distribution
Mean = median = mode
Describe skewed distribution
Mean ≠ median ≠ mode
Describe parametric vs non-parametric statistics
Parametric Statistics
- Normal distribution
- Based on mean and standard deviation
Non-parametric Statistics
- Skewed distribution or where you can’t prove it’s a normal distribution (small sample)
- Based on median and interquartile range
How can I tell if my data forms a normal distribution?
Mean = Median = Mode
Plot the data and judge by eye
Do a test for normality e.g. Shapiro-Wilk (small sample) or Kolmogorov-Smirnov (large sample)
Your sample should have more than 30 values to use parametric statistics
When do you use median?
Median is used when the data is skewed or there is not enough data to tell if there is skew (typically less than 30 in the sample)
When is hypothesis testing used?
Used in quantitative research to clarify what it is we are testing statistically
Define null hypothesis
Null hypothesis (H0) assumes that there is no difference between the groups being tested
E.g. if we want to know if the IQ measurements of boys and girls are different at age 11, then we can say H0 = the IQ values of boys and girls at age 11 are not different
Define alternative hypothesis
Alternative hypothesis (H1) - this holds if the null hypothesis can be rejected
E.g. H1 = the IQ values of boys and girls at age 11 are different
When would you use one-tailed or two-tailed tests?
Two tailed tests – if the alternate hypothesis doesn’t have a direction
E.g. H0 = the number of fillings in men and women is the same; H1 = the number of fillings in men and women is different (could be less or more)
One tailed tests – if the alternate hypothesis does have a direction
E.g. H0 = the number of fillings in men and women is the same; H1 = men have more fillings than women
Give one example test for each type of data
Categorical
Numerical, parametric
Numerical, non-parametric
Correlation
Dependence of variables
Categorical – Chi squared test
Numerical, parametric – Student’s t test
Numerical, non-parametric – Mann Whitney U test
Correlation – Pearson correlation coefficient
Dependence of variables – Simple linear regression
Describe the importance of p-values (α) with respect to null hypothesis
The maximum probability of making a Type I error is α – we call this the significance level, and we usually focus on an error level of 5%
This means that there is a 5% chance that from our results we will reject a true null hypothesis and therefore get a false positive result
We can reject the null hypothesis if the p value that we calculate is less than α (5%)
At the 95% level of significance, we say that we can reject the null hypothesis if p < 5%, i.e. p < 0.05
This means there is a less than 5% chance that we have incorrectly rejected a true null hypothesis and inferred that there is an effect in whatever we have tested when there isn’t

Describe power in statistics
We use power calculations to decide whether our research sample is big enough to give us meaningful data
The power is the probability of correctly rejecting the null hypothesis when it is false i.e. we detect as statistically significant a real effect
Power = 1 – β (1 minus the probability of making a Type II error, i.e. of reporting a false negative)
Should be more than 80%
What is the ‘Gold standard’ in quantitative research
Double-blind randomized control trial
Describe crossover trial
Same methods as RCT
Group A regime
- Treatment
- “Washout”
- Placebo
Group B regime
- Placebo
- “Washout”
- Treatment”
Describe cohort study
Follow a sample of people over time
Longitudinal, prospective
Useful when it is not appropriate to compare treatments
Describe case-control study
Compares people with and without disease
Try to identify causes
Describe cross-sectional study
Surveys a sample of population at a point in time (snapshot)
Describe case report
Look at rare conditions and/or novel treatments
Recall the hierarchy of evidence (6 points)

List the tests for categorical data
Categorical data
2 categories:
- 1 group = Z test
- 2 groups (paired) – McNemar
- 2 groups (independent) – Chi squared; Fisher’s exact
- > 2 groups – Chi squared
More than 2 categories:
- Chi squared
List the tests for numerical (parametric) data
Numerical - parametric
- 1 group – 1 sample t test
- 2 groups (paired) – paired t test
- 2 groups (unpaired) – unpaired t test
- > 2 groups – ANOVA
- > 2 testing levels – MANOVA
List the tests for numerical (non-parametric) data
Numerical – non-parametric
- 1 group – Sign test
- 2 groups (paired) – Wilcoxon signed rank test
- 2 groups (unpaired) – Wilcoxon rank sum test (Mann Whitney U test)
- > 2 groups – Kruskal-Wallis test
Name a test of association - parametric data
Correlation – Pearson’s correlation
Name a test of association - non-parametric data
Correlation - Spearman rank correlation
Name 2 tests of prediction
Simple linear regression
Multiple linear regression
Define standard deviation vs standard error
Standard Deviation = the variability of a sample
Describes the data (quantifies the scatter)
Standard Error = the variability of all the sample means
Facilitates an estimate of the mean of a population based on a sample mean (quantifies how precisely we can know the population mean)
Describe confidence interval (CI)
We can be 95% confident that the population mean lies somewhere between the sample mean + or – 1.96 standard errors of the mean.
Wide confidence intervals are imprecise
Confidence intervals can be used to assess clinical importance of trial results
If confidence interval crosses 0 - then not statistically significant
Recall and example comparing statistical significance and clinical significance?
Example 1 – trial of a new toothpaste
A study shows that there is a statistically significant reduction in caries (p = 0.032)
The actual reduction, though statistically significant, is 1%.
Would you now recommend changing to that toothpaste?
Describe correlation
Correlation measures association between continuous variables.
For example – mortality from lung cancer and cigarette smoking
CAUTION: correlation measures association, not causation. It measures the extent of the association – how great is the effect?
Describe correlation coefficient (r)
The correlation coefficient, r, is a value between -1 and +1.
It measures the strength and direction of the association
Positive r = positive association (a high score in Biology correlates with a high score in Chemistry)
Negative r = negative association (a high score in Biology correlates with a low score in Chemistry)
Describe correlation coefficient of determination
Example
r = 0.662 (positive correlation) r<sup>2</sup> = 0.438 (**coefficient of determination**)
r2 = percentage variability – e.g. how much variability in chemistry exams scores is explained by biology scores?
e.g. 0.437 = 43.8% (explained by biology scores). Therefore 56.2% of the variability is caused by other factors.
Recall which correlation co-efficient to use for parametric and non-parametric data
Parametric data – use Pearson correlation coefficient
Non-parametric data – use Spearman rank correlation coefficient
Describe purpose of linear regression
Purpose – to predict one variable from another
Follows on from correlation
Example – biology and chemistry scores; how well can we predict chemistry scores from the biology scores?
Involves plotting biology (independent variable) on the x axis against chemistry (dependent variable) on the y axis.
Based on y = mx + c, i.e. draws line of best fit.
Measures minimum vertical distance between each point and the line of best fit (residuals).
y = Chemistry mark, x = Biology mark m = gradient of line c = intercept
NB. If data is not linear, do transformation first (log, exponential, square, root – whatever works best)
When is multiple linear regression (MLR) used? Recall 2 examples
Used in public health and epidemiology
Example 1 – looking at scores for Biology and Chemistry in male and female students. What amount of variability between Biology and Chemistry scores is due to gender?
Example 2 – we know that height and weight are correlated. How does BP correlate with height? Do MLR on height, weight and gender against BP P value – for each parameter when adjusted for the other parameters (confounding variables) Will tell us if BP is related to height, once weight and gender factors are taken out of the reckoning
When is ANOVA (analyisis of variation) used? Recall the anaesthetics example
Analysis of Variance (ANOVA) used to analyse more than two continuous variables
Example - five Local Anaesthetics – looking at time to take effect (single factor)
H0: LA1 = LA2 = LA3 = LA4 = LA5
H1: at least one of the above LAs differs from the others
Test generates an F ratio and a p value
If the calculated F ratio exceeds the critical value for F, we can reject the null hypothesis – this is the case in our example, therefore at least one of the mean times to take effect is different from the other means
What is ANOVA post-hoc analysis?
Post-hoc analysis – how do we find out which mean is different from the others (or more than one mean that is different)?
Option 1
- Carry out Tukey test (or similar) to see which group(s) is/are different from the mean
- Tukey statistic becomes the value to compare means e.g. if the difference between the means of LA1 and LA2 is greater than the Tukey statistic, then the difference between those means is significant, and vice versa
Option 2
- Carry out 2 sample 2-way t tests between pairs of means.
- Then apply the Bonferroni correction. This adjusts the significance from
- *p <** 0.05 as follows:
- Divide 0.05 by the number of tests being carried out
- E.g. if you do 5 tests, the significance level becomes 0.05/5 = 0.01 – raises the “burden of proof” for any one test carried out (p must now be < 0.01 for each t test)
Recall which ANOVA method to use when considering one or more factors
If one factor is being considered, use 1-way ANOVA
E.g. time taken for different LAs to take effect
If two factors are being considered, use 2-way ANOVA
E.g. time taken for different LAs to take effect in men and women
If more than two factors are being considered, use MANOVA
E.g.time taken for different LAs to take effect in men and women of different ages, weights, liver function etc
What is survival analysis and give 3 examples of different types
Measures time taken to an event, e.g. death, significant event e.g. MI, failure of a resin-retained bridge
Examples:
- Kaplan-Meier survival analysis
- Log-rank test – a non-parametric test where the null hypothesis states that the pattern of survival for two groups is the same
- Cox proportional hazard model – a regression technique that allows modelling to be carried out in order to adjust for potentially confounding variables
Recall survival analysis outputs
Survival analysis output:
- Median survival time – at which the probability of survival is 0.5
- Survival rate – the proportion of individuals surviving longer than time t
- Hazard function – risk of having an event at time t
- Survivor plot – survival rate / time
Describe relative risk (RR)
Shows the extent of difference between exposed and unexposed groups.
E.g. study results for different toothpastes and effect on gingivitis
- Exposed (used Sparkledent) = cure 55%; disease 45%
- Unexposed (established toothpaste) = cure 35%; disease 65%
In this case we can calculate RR for those who are cured and those who still have disease
RR cure = 55% (exposed and cured) / 35% (not exposed and cured) = 1.57
RR disease = 45% (exposed and diseased) / 65% (not exposed and diseased) = 0.69
Describe risk difference
The difference between risks for cure and diseased groups
E.g. study results for different toothpastes and effect on gingivitis
- Exposed (used Sparkledent) = cure 55%; disease 45%
- Unexposed (established toothpaste) = cure 35%; disease 65%
Risk difference for cure = 55% - 35% = 20%
Risk difference for disease = 45% - 65% = -20%
From the risk difference we can calculate NNT = 1/risk difference * 100 = 100/20 = 5
Define number needed to treat (NNT)
NNT = (1 / risk difference) x 100
Describe odds ratio (OR)
Require the number of cases rather than percentages.
Calculates the odds of being cured against still having the disease.
E.g. study results for different toothpastes and effect on gingivitis
- Exposed (used Sparkledent) = cure 55%; disease 45%
- Unexposed (established toothpaste) = cure 35%; disease 65%
The disease odds ratio would tell us what is the chance of still having gingivitis depending on whether a person was exposed or not exposed to the new toothpaste.
Odds ratio summarises differences in numbers, whereas relative risk summarises differences in incidence rates