Statistics Flashcards

Question 1

Q

Name two types of research

Answer

A

Quantitative
Qualitative

Question 2

Q

Define qualitative research

Answer

A

Qualitative involves meaning, opinion, attitudes and beliefs, seeking deep information, answering complex questions, social understanding

Question 3

Q

Define quantitative research

Answer

A

Quantitative involves numbers, proportions, statistics, testing hypotheses, looking at cause and effect

Question 4

Q

Name 4 types of data

Answer

A

Categorical (discrete)
Numerical (continuous)
Calculated
Censored

Question 5

Q

What type of data have you got?

Smokers / Non-smokers

Answer

A

Categorical

Two possible categories

Binary

Question 6

Q

What type of data have you got?

Married / Single / Divorced / Widowed

Answer

A

Categorical

More than two categories

No particular order

Nominal

Question 7

Q

What type of data have you got?

Strongly agree / Agree / Neither agree nor disagree / Disagree / Strongly disagree

OR

BPE index (0, 1, 2, 3, 4)

Answer

A

Categorical

More than two categories

Order is important, but no numerical relationship between numbers

Ordinal

Question 8

Q

What type of data have you got?

Number of sick days taken

OR

Number of fillings

Answer

A

Numerical

On a scale

Integers only along the scale; numbers are related

Count

Question 9

Q

What type of data have you got?

Temperature in degrees Celsius

Answer

A

Numerical

No real zero

Can have data at any point along the scale

Interval – 20.4 degrees Celsius is not twice as hot as 10.2 degrees Celsius

Question 10

Q

What type of data have you got?

Height in cm

Answer

A

Numerical

You can have zero cm

Can have data at any point along the scale

Ratio – 20.4 cm is twice as high as 10.2 cm

Question 11

Q

What type of data have you got?

BMI (Body Mass Index)

OR

HAD score (Hospital Anxiety and Depression score)

Answer

A

Calculated

The data have been derived from a calculation based on other measurements

Question 12

Q

Summarize the types of data (4 types)

Answer

A

Categorical

Binary
Nominal
Ordinal

Numerical

Count
Interval
Ratio

Calculated- e.g. BMI

Censored - e.g. loss to follow-up

Question 13

Q

Why aren’t the mean, mode and median the same?

5, 0, 2, 5, 24, 0, 7, 1, 15, 0, 16, 2, 4, 6, 2, 3, 5, 10, 3, 5

Mean = 5.75
Median = 4.5
Mode = 5

Answer

A

The data may be “skewed”

Question 14

Q

Describe normal distribution

Answer

A

Mean = median = mode

Question 15

Q

Describe skewed distribution

Answer

A

Mean ≠ median ≠ mode

Question 16

Q

Describe parametric vs non-parametric statistics

Answer

A

Parametric Statistics

Normal distribution
Based on mean and standard deviation

Non-parametric Statistics

Skewed distribution or where you can’t prove it’s a normal distribution (small sample)
Based on median and interquartile range

Question 17

Q

How can I tell if my data forms a normal distribution?

Answer

A

Mean = Median = Mode

Plot the data and judge by eye

Do a test for normality e.g. Shapiro-Wilk (small sample) or Kolmogorov-Smirnov (large sample)

Your sample should have more than 30 values to use parametric statistics

Question 18

Q

When do you use median?

Answer

A

Median is used when the data is skewed or there is not enough data to tell if there is skew (typically less than 30 in the sample)

Question 19

Q

When is hypothesis testing used?

Answer

A

Used in quantitative research to clarify what it is we are testing statistically

Question 20

Q

Define null hypothesis

Answer

A

Null hypothesis (H₀) assumes that there is no difference between the groups being tested

E.g. if we want to know if the IQ measurements of boys and girls are different at age 11, then we can say H₀ = the IQ values of boys and girls at age 11 are not different

Question 21

Q

Define alternative hypothesis

Answer

A

Alternative hypothesis (H₁) - this holds if the null hypothesis can be rejected

E.g. H₁ = the IQ values of boys and girls at age 11 are different

Question 22

Q

When would you use one-tailed or two-tailed tests?

Answer

A

Two tailed tests – if the alternate hypothesis doesn’t have a direction
E.g. H₀ = the number of fillings in men and women is the same; H₁ = the number of fillings in men and women is different (could be less or more)

One tailed tests – if the alternate hypothesis does have a direction
E.g. H₀ = the number of fillings in men and women is the same; H₁ = men have more fillings than women

Question 23

Q

Give one example test for each type of data

Categorical

Numerical, parametric

Numerical, non-parametric

Correlation

Dependence of variables

Answer

A

Categorical – Chi squared test

Numerical, parametric – Student’s t test

Numerical, non-parametric – Mann Whitney U test

Correlation – Pearson correlation coefficient

Dependence of variables – Simple linear regression

Question 24

Q

Describe the importance of p-values (α) with respect to null hypothesis

Answer

A

The maximum probability of making a Type I error is α – we call this the significance level, and we usually focus on an error level of 5%

This means that there is a 5% chance that from our results we will reject a true null hypothesis and therefore get a false positive result

We can reject the null hypothesis if the p value that we calculate is less than α (5%)

At the 95% level of significance, we say that we can reject the null hypothesis if p < 5%, i.e. p < 0.05

This means there is a less than 5% chance that we have incorrectly rejected a true null hypothesis and inferred that there is an effect in whatever we have tested when there isn’t

Question 25

Q

Describe power in statistics

Answer

A

We use power calculations to decide whether our research sample is big enough to give us meaningful data

The power is the probability of correctly rejecting the null hypothesis when it is false i.e. we detect as statistically significant a real effect

Power = 1 – β (1 minus the probability of making a Type II error, i.e. of reporting a false negative)

Should be more than 80%

Question 26

Q

What is the ‘Gold standard’ in quantitative research

Answer

A

Double-blind randomized control trial

Question 27

Q

Describe crossover trial

Answer

A

Same methods as RCT

Group A regime

Treatment
“Washout”
Placebo

Group B regime

Placebo
“Washout”
Treatment”

Question 28

Q

Describe cohort study

Answer

A

Follow a sample of people over time

Longitudinal, prospective

Useful when it is not appropriate to compare treatments

Question 29

Q

Describe case-control study

Answer

A

Compares people with and without disease

Try to identify causes

Question 30

Q

Describe cross-sectional study

Answer

A

Surveys a sample of population at a point in time (snapshot)

Question 31

Q

Describe case report

Answer

A

Look at rare conditions and/or novel treatments

Question 32

Q

Recall the hierarchy of evidence (6 points)

Question 33

Q

List the tests for categorical data

Answer

A

Categorical data

2 categories:

1 group = Z test
2 groups (paired) – McNemar
2 groups (independent) – Chi squared; Fisher’s exact
> 2 groups – Chi squared

More than 2 categories:

Chi squared

Question 34

Q

List the tests for numerical (parametric) data

Answer

A

Numerical - parametric

1 group – 1 sample t test
2 groups (paired) – paired t test
2 groups (unpaired) – unpaired t test
> 2 groups – ANOVA
> 2 testing levels – MANOVA

Question 35

Q

List the tests for numerical (non-parametric) data

Answer

A

Numerical – non-parametric

1 group – Sign test
2 groups (paired) – Wilcoxon signed rank test
2 groups (unpaired) – Wilcoxon rank sum test (Mann Whitney U test)
> 2 groups – Kruskal-Wallis test

Question 36

Q

Name a test of association - parametric data

Answer

A

Correlation – Pearson’s correlation

Question 37

Q

Name a test of association - non-parametric data

Answer

A

Correlation - Spearman rank correlation

Question 38

Q

Name 2 tests of prediction

Answer

A

Simple linear regression

Multiple linear regression

Question 39

Q

Define standard deviation vs standard error

Answer

A

Standard Deviation = the variability of a sample
Describes the data (quantifies the scatter)

Standard Error = the variability of all the sample means
Facilitates an estimate of the mean of a population based on a sample mean (quantifies how precisely we can know the population mean)

Question 40

Q

Describe confidence interval (CI)

Answer

A

We can be 95% confident that the population mean lies somewhere between the sample mean + or – 1.96 standard errors of the mean.

Wide confidence intervals are imprecise

Confidence intervals can be used to assess clinical importance of trial results

If confidence interval crosses 0 - then not statistically significant

Question 41

Q

Recall and example comparing statistical significance and clinical significance?

Answer

A

Example 1 – trial of a new toothpaste

A study shows that there is a statistically significant reduction in caries (p = 0.032)

The actual reduction, though statistically significant, is 1%.

Would you now recommend changing to that toothpaste?

Question 42

Q

Describe correlation

Answer

A

Correlation measures association between continuous variables.

For example – mortality from lung cancer and cigarette smoking

CAUTION: correlation measures association, not causation. It measures the extent of the association – how great is the effect?

Question 43

Q

Describe correlation coefficient (r)

Answer

A

The correlation coefficient, r, is a value between -1 and +1.

It measures the strength and direction of the association

Positive r = positive association (a high score in Biology correlates with a high score in Chemistry)

Negative r = negative association (a high score in Biology correlates with a low score in Chemistry)

Question 44

Q

Describe correlation coefficient of determination

Answer

A

Example

r = 0.662 (positive correlation)
r<sup>2</sup> = 0.438 (**coefficient of determination**)

r² = percentage variability – e.g. how much variability in chemistry exams scores is explained by biology scores?

e.g. 0.437 = 43.8% (explained by biology scores). Therefore 56.2% of the variability is caused by other factors.

Question 45

Q

Recall which correlation co-efficient to use for parametric and non-parametric data

Answer

A

Parametric data – use Pearson correlation coefficient

Non-parametric data – use Spearman rank correlation coefficient

Question 46

Q

Describe purpose of linear regression

Answer

A

Purpose – to predict one variable from another

Follows on from correlation

Example – biology and chemistry scores; how well can we predict chemistry scores from the biology scores?

Involves plotting biology (independent variable) on the x axis against chemistry (dependent variable) on the y axis.

Based on y = mx + c, i.e. draws line of best fit.
Measures minimum vertical distance between each point and the line of best fit (residuals).

y = Chemistry mark, x = Biology mark m = gradient of line c = intercept
NB. If data is not linear, do transformation first (log, exponential, square, root – whatever works best)

Question 47

Q

When is multiple linear regression (MLR) used? Recall 2 examples

Answer

A

Used in public health and epidemiology

Example 1 – looking at scores for Biology and Chemistry in male and female students. What amount of variability between Biology and Chemistry scores is due to gender?

Example 2 – we know that height and weight are correlated. How does BP correlate with height? Do MLR on height, weight and gender against BP P value – for each parameter when adjusted for the other parameters (confounding variables) Will tell us if BP is related to height, once weight and gender factors are taken out of the reckoning

Question 48

Q

When is ANOVA (analyisis of variation) used? Recall the anaesthetics example

Answer

A

Analysis of Variance (ANOVA) used to analyse more than two continuous variables

Example - five Local Anaesthetics – looking at time to take effect (single factor)

H_0:LA1 = LA2 = LA3 = LA4 = LA5

H₁: at least one of the above LAs differs from the others

Test generates an F ratio and a p value

If the calculated F ratio exceeds the critical value for F, we can reject the null hypothesis – this is the case in our example, therefore at least one of the mean times to take effect is different from the other means

Question 49

Q

What is ANOVA post-hoc analysis?

Answer

A

Post-hoc analysis – how do we find out which mean is different from the others (or more than one mean that is different)?

Option 1

Carry out Tukey test (or similar) to see which group(s) is/are different from the mean
Tukey statistic becomes the value to compare means e.g. if the difference between the means of LA1 and LA2 is greater than the Tukey statistic, then the difference between those means is significant, and vice versa

Option 2

Carry out 2 sample 2-way t tests between pairs of means.
Then apply the Bonferroni correction. This adjusts the significance from
*p <** 0.05 as follows:
- Divide 0.05 by the number of tests being carried out
- E.g. if you do 5 tests, the significance level becomes 0.05/5 = 0.01 – raises the “burden of proof” for any one test carried out (p must now be < 0.01 for each t test)

Question 50

Q

Recall which ANOVA method to use when considering one or more factors

Answer

A

If one factor is being considered, use 1-way ANOVA
E.g. time taken for different LAs to take effect

If two factors are being considered, use 2-way ANOVA
E.g. time taken for different LAs to take effect in men and women

If more than two factors are being considered, use MANOVA
E.g.time taken for different LAs to take effect in men and women of different ages, weights, liver function etc

Question 51

Q

What is survival analysis and give 3 examples of different types

Answer

A

Measures time taken to an event, e.g. death, significant event e.g. MI, failure of a resin-retained bridge

Examples:

Kaplan-Meier survival analysis
Log-rank test – a non-parametric test where the null hypothesis states that the pattern of survival for two groups is the same
Cox proportional hazard model – a regression technique that allows modelling to be carried out in order to adjust for potentially confounding variables

Question 52

Q

Recall survival analysis outputs

Answer

A

Survival analysis output:

Median survival time – at which the probability of survival is 0.5
Survival rate – the proportion of individuals surviving longer than time t
Hazard function – risk of having an event at time t
Survivor plot – survival rate / time

Question 53

Q

Describe relative risk (RR)

Answer

A

Shows the extent of difference between exposed and unexposed groups.

E.g. study results for different toothpastes and effect on gingivitis

Exposed (used Sparkledent) = cure 55%; disease 45%
Unexposed (established toothpaste) = cure 35%; disease 65%

In this case we can calculate RR for those who are cured and those who still have disease

RR cure = 55% (exposed and cured) / 35% (not exposed and cured) = 1.57

RR disease = 45% (exposed and diseased) / 65% (not exposed and diseased) = 0.69

Question 54

Q

Describe risk difference

Answer

A

The difference between risks for cure and diseased groups

E.g. study results for different toothpastes and effect on gingivitis

Exposed (used Sparkledent) = cure 55%; disease 45%
Unexposed (established toothpaste) = cure 35%; disease 65%

Risk difference for cure = 55% - 35% = 20%

Risk difference for disease = 45% - 65% = -20%

From the risk difference we can calculate NNT = 1/risk difference * 100 = 100/20 = 5

Question 55

Q

Define number needed to treat (NNT)

Answer

A

NNT = (1 / risk difference) x 100

Question 56

Q

Describe odds ratio (OR)

Answer

A

Require the number of cases rather than percentages.

Calculates the odds of being cured against still having the disease.

E.g. study results for different toothpastes and effect on gingivitis

Exposed (used Sparkledent) = cure 55%; disease 45%
Unexposed (established toothpaste) = cure 35%; disease 65%

The disease odds ratio would tell us what is the chance of still having gingivitis depending on whether a person was exposed or not exposed to the new toothpaste.

Odds ratio summarises differences in numbers, whereas relative risk summarises differences in incidence rates