Statistics Flashcards
What is quantitative data?
Numerical data
- Discrete (whole number)- eg number of children
- Continuous (usually a measurement)
Give an example of nominal data?
Blood group, gender
Group that contains no logical order
Type of categorical data
Name types of qualitative data?
Categorical data
- Nominal- contains no logical order
- ordinal- categories have a natural order.
If data is negatively skewed, what is the order of the mean, median and mode (from low to high/ L->R).
Mean, Median, Mode
Peak of graph further to right
If data is positively skewed (right skewed) then what order is the mean, median and mode (from left _>R)
Mode, median, mean
What is the range?
Maximum - minimum.
Poor measure of spread as affected by outliers and dependent on sample size
What is the inter-quartile range?
upper quartile - lower quartile
Better than range as not influenced by outliers
3 measures- lower quartile, median and upper quartile
What is variance?
Calculate deviations = difference between each observation and the mean of the data.
Square these deviations so negatives become positive
Average the squared deviations by dividing by n-1 (lose a degree of freedom, the mean has already been included)
Square root of the variance = standard deviation
Influenced by outliers.
How do you calculate standard deviation from variance
Square root of variance
What would you use to summarise symmetrical data?
Mean
Standard deviation
What would you use to summarise skewed data?
Median
Interquartile range
How would you summarise categorical data?
Use number (%)
What information does a box and whisker plot give you>
Median
IQ range
Range
How can you summarise categorical data in a chart?
Pie chart
Bar chart
What is the mean and SD in a normal distribution data set?
Mean = 0 SD= 1
What is the reference range and when can it be used and what does it measure?
Used in NORMAL DISTRIBUTION
Mean +/- 1.96 SD = often rounded to +/- 2SD = 95% data
Measure of spread of the data
In normal distribution data how much of data is included in mean +/- 1SD, +/- 2SD and +/- 3SD?
Mean +/- 1 SD = 68% data included
Mean +/- 2SD = 95% data included = reference range
Mean +/- 3SD = 99% data included
What is the difference between the 95% reference range and 95% confidence interval?
95% reference range (or normal range)
- Mean +/- 2SD
- Measures SPREAD of data
95% confidence interval
- mean +/- 2 standard errors
- Measures the ACCURACY of a sample estimate (95% probability that the interval contains true population value)
How can you make positively skewed data more symmetric?
Calculate
- Log (x)
- 1/x
- square root x
More difficult with negatively skewed date
How can you check if a data set is normally distributed?
- By eye - draw a histogram
- test for normality eg Kolmogorov-Smirnov test or Shapiro-Wilk test
If p <0.05 conclude not normal
If p>0.05 no evidence against normal
but small samples will have insufficient power to detect deviations from normality and for large samples normality usually less important
What is bias and how can you avoid it?
Bias: when the sample is selected in such a way that even with a very large sample you will not get the true answer
Avoid with a random sample
What is precision?
A sample estimate is precise if different samples of the same size, selected in the same way would give answers which are close together
WHat is a distribution defined by:
- centre (mean)
- Spread (SD)
- Shape (i.e. normally distributed)
When will sample means be normally distributed?
- the underlying data is normally distributed
- the samples are large (in which case does not matter if the data are normal or not - Central limit theorem)
What is standard error of distribution of sample means?
SE of a distribution of sample means is a measure of the spread of those means.
It is the standard deviation of a sampling distribution
MEASURES PRECISION OF THE SAMPLE MEAN
If there is a narrow spread of data - will the standard error by small or big?
Small - all means close to the true mean - precise estimate
As sample size increases what happens to the standard error of means?
Gets smaller
How do we calculate standard error of a distribution of sample means?
SE= σ/ √N σ= SD of the population observations N= sample size
However we don’t have data from the whole population so have to make do with SD (s) of a single sample to estimate the σ. As long as sample is large this should be a good measure.
SE estimated = s/ √N
Other than standard error of a distribution of sample means, what other types of sample estimate can you use?
Sample proportion Difference between 2 means Difference between 2 proportions Relative risk/odds ratio Regression coefficients
They all have different standard error formulae.
Calculate the standard error of a proportion - using categorical data. If 20% of 100 people have asthma.
SE (p) = √(px (1-p)/n)
SE (0.2) = √(0.2x0.8/100)= 0.04
68% CI for asthma 0.2 +/- 0.04
What is a confidence interval and how do you calculate it?
An interval around a sample estimate within which there is a 95% probability that the true population value lies
Sample mean +/- 1.96 SEs
When would you look if 0 lies within the CI and when would you look if 1 lied within the CI?
If looking at difference between means and proportions - does 0 lie in the CI??
If looking at relative risk or odds ratio- does 1 lie in CI??
Then not statistically significant
Name some types of intervention studies?
RCT
Non- randomised clinical intervention studies
Experimental lab studies
Name some types of observational studies?
Cohort studies Case- control studies Cross-sectional study Ecological study Case study
Describe a cohort study.
Usually disease free cohort followed over time and subsequent disease status recorded.
Usually prospective
Accurate
What are the advantages and disadvantages of a cohort study?
Accurate
Selection bias avoided
BUT..long and expensive, loss to follow up and inappropriate for rare diseases
Describe a case control study.
Cases who already have the disease are compared to disease free controls
Retrospective
What are the advantages and disadvantages of a case control study?
Quick and cheap
Suitable for rare diseases
BUT…
- subject to recall bias, selection bias, assessment bias
- relative timings can be difficult to ascertain
- Not suitable for rare exposures
- relative risks cannot be directly calculated
How is the association between a risk factor and disease outcome commonly summarised?
Relative risk
Odds ratio
If a relative risk if >1 what does that mean?
RR>1 = increased risk
RR <1 = decreased risk
When can’t you use relative risk and what can you use instead?
Case control study - RR would not work in case control as you have picked the number of people with the disease.
Use odds ratio instead.
How do you calculate relative risk?
Outcome
RF Present Absent
Present a b a+b
Absent c d c+d
a+c b+d
RR = (a / a +b) / (c / c +d)
Number with risk factor + disease/ total number with risk factor divided by number without risk factor and with disease/ total number without risk factor
What is odds ratio and how do you calculate it?
Outcome
RF Present Absent
Present a b a+b
Absent c d c+d
a+c b+d
Odds of having the risk factor among the cases vs odds of having a risk factor in controls
Odds ratio = (a/c) / (b/d)
What is the null hypothesis?
Statement that there is no difference between groups in the population from which the sample has come.
ALWAYS about the population - would not make sense to hypothesis about the sample as we already known about that
What is the p value?
Probability of obtaining sample data showing a difference as large or larger as that observed, if there is really no difference in the population from which the samples came i.e. the null hypothesis is true
What does a p value <0.05 mean?
Unlikely that the sample could have come from a population where the null hypothesis is true <5% chance.
What does a p value >0.05 mean?
Is is possible that the sample could have come from a population where the null hypothesis is true -> insufficient evidence to reject the null hypothesis (NEVER say we accept the null hypothesis)
Choosing the right statistical test.
Are you comparing means or percentages when looking at numerical, categorical and ordinal data??
Variable is numerical - you will be comparing means
Variable is categorical you will be comparing percentages
Variable is ordinal- you may use a specific test for ordinal data or you may treat the variable as categorical
If you are comparing numerical data and want to compare paired groups then what is the right statistical test?
Paired T test
- Paired difference are normally distributed or large sample size (>100 pairs)
Wilcoxon’s signed ranks test
- does not need normal distribution
- NOT appropriate for ordinal data (as compares distributions not means)
What is a paired group?
Two types:
- when the same person provides 2 values (eg crossover trial)
- when each person is one group has a matched control in another group (eg case control studies)
When choosing the right statistical significance test, what questions might you ask?
Are you comparing means or percentages? How many groups are you comparing? Are the groups paired on independent? Are the test assumptions met? - sample size - distributions - equal variances
When can you use the independent samples t-test and what assumptions does it make?
Comparing means of 2 independent groups
Data normally distributed (or if >50 in each group)
Normal variance
What can you look out for which might show data is skewed?
Skewed data often summarised using medians instead of means
If mean - 2SDs takes you below minimum possible value (often zero), or mean +2SDs takes you above the max possible value then the data cannot be normally distributed.
What does equal variances mean?
Equal distribution around the mean.
Can have normal distribution but different variance - bell is flatter or thinner but still symmetrical.
How can you test for equal variances?
Do a statistical test eg Levene’s test
- if p <0.05 conclude variances not equal, if >0.05 no evidence against variances.
- BUT if sample size small unlikely to have sufficient power and if large likely to pick up unimportant differences.
Could check for equal standard deviations. (less than a factor of 1.5 is ok)
If variances not equal then some packages perform separate variances version of t-test
Or could try transforming data (if positively skewed taking logs)
When would you use the Mann-Whitney test?
If assumptions for independent samples T test are not met.
I.e. non-parametric data
Can be used for numerical of ordinal data
Less powerful than the T test
What does the paired T test assume?
Paired differences are normally distributed (raw data can be skewed but the paired differences should be normally distributed)
If >100 pairs can drop this.
When would you use the Wilcoxon’s signed ranks test
Non parametric paired data
Generally less powerful than the paired t test
NOT ordinal data
When would you use the ANOVA/analysis of variance test?
Normally distributed with equal variances
Used for >2 groups
P >0.05 no evidence of real difference between any pair of groups
p<0.05 there is evidence of a real difference between either some or all of the groups
Does NOT tell you which group
Needs follow up with post hoc test which tell you which groups have difference.
- compare each pair of groups
- automatically make an adjustment for multiple testing
Many tests available including Scheffe, Bonferri
When would you use the Kruskal-Wallis test
Non- parametric test
For > 2 groups
less powerful than ANOVA
Can be used for ordinal data
When can you use the chi squared test?
Comparing percentages- categorical data
Between 2 independent groups
Calculate observed (O) and expected (E) frequencies (O-E) ^2 / E
When can’t you use the chi squared test? What would you use instead
- any cells have expected freq <1
- > 20% cells have an expected freq < 5
Then use Fishers exact test (no min sample size)
When would you use McNemar’s test?
Paired groups comparing the percentages
Only valid if number of discordant partners at least 10
When would you use Chi-squared test for trend?
Ordinal variable- ordered groups
Large sample >30
Percentages increase/decrease linearly across groups.
What is the difference in null & alternative hypothesis for a one and two sided test?
2 sided test - difference can be in either direction
Null hypothesis: no difference between groups
Alternative hypothesis: there is a difference between groups, could be in either direction
1 sided test
Null hypothesis- no difference between groups or a difference in 1 direction
Alternative hypothesis - difference in other direction.
More likely to get a statistically significant test in a 1 sided test as have 5% at top.
When would you use a one sided test?
Non-inferiority trial
Should not be used because a true difference in one directions is thought to be very unlikely
What is a significance level?
α = significance level of test
Usually set at 0.05
p <0.05 is significance level