stats Flashcards
median
50th percentile
quantifies average
50% of data above median, 50% beloe
when is data symmetrial with resepect to median
when median is equidistant from upper and lower quartile boundaries
when is negative skew seen wiith respect to median
when median is closer to upper quartile
how do you check symmetry of variables
box and whisper
histogram
difference of 99% CI compared to 95% CI
99% CI would be a wider range than 95% CI and extend it at both extremes
if p>0.05
no evidence
there may truly be no difference in the mean of the variables
the sample may be too small to detect a difference
smaller standard error means
the estimate of the mean is more precise
2-tailed test
difference in sample means in either direction provides evidence against null hypothesis
when is mann whitney test used
if variables are discrete/categorical/ordinal
if data is non-parametric
a parametric test makes strong assumptions on..
distribution of data
what does wilcoxon signed-rank test compare
distribution between first and second measurement
assesses whether population mean ranks differ
when is wilcoxon signed-rank test used
matched/paired data
when assumptions of paired t-test do not fit
what does standard error indicate
indicates how far the study estimate would be from the true value in the population if you were to repeat the study multiple times with different samples
p-value if CI excludes the null hypothesis value
p<0.05
there is some evidence
define odds
how common a binary characteristic is to occur for a single group
odds ratio
measure of association between exposure and outcome
odds of one group compared to another
reference category
odds of ref category = 1
used to compared odds
pearsons correlation coefficient
r
quantifies the strength of linear association between two variables
assumptions for pearsons correlation
linear relationship between variables
what does r squared (pearsons) refer to
the proportion of variation in one variable explained by the other variable
what does linear regression desribe
the relationship between two quantitative variables
one variable is independant and affects the other dependant variable
equation for linear regression
outcome = a + b(predictor)
how do you calculate diagnostic accuracy
PPV
NPV
how do you calculate sensitivity
no. who correctly tested +ve for the disease / total no. who have the disease
how do you calculate specificity
no. of people correctly test -ve / total no. of healthy people
how do you calculate PPV
no. of people who correctly test +ve / total no. of people who test +ve
use of normal distribution
determines choice of statistical methods
mean and sd define
normal distribution
define population
full set of units (people) to which the study results will be generalised
usually infinite in size
why might there be uncertainty in the answer provided by the sample data
variability between people
sample is only a subset of the population - not fully representative
what are statistics for
summarising sample data
quantifying uncertainty in results
2 types of statistics
inferential
descriptive
descriptive statistics
describe basic features/characteristics in the sample
inferential statistics
make inferences about relationships in the population using the sample
however can never be 100% certain
e.g. standard error, CI, p-values
sampling distribution
all the different estimates from different samples and their frequencies
effect of sample size on CI
the larger the sample size the narrower the CI
effect of CI on certainty/uncertainty
the wider the CI, the greater the uncertainty
what do p-values quantify
the extent to which the sample estimate contradicts the null hypothesis
what does PICO stand for
population/patient
intervention
comparison
outcome
what does the t in PICO(T) stand for
type of study design that would work best
why is PICO used
to frame or answer a health related question
when is data paired
if data are matched on criteria e.g. age/gender before comparing on either trial arm
if measurements are taken before and after an interventoin
what does paired data analyse
within-pair differences
parametric methods
e.g. t-test, analysis of variance (ANOVA)
make distribtuional assumptions eg. Normal
summarise data using means and sd
parametric method for 3 or more independent groups
ANOVA
what does ANOVA stand for
analysis of variance
parametric methods for 3 or more dependant groups
paired test
repeat measures of ANOVA
when do you use a non-parametric test
if variables are skewed
small sample size
if sd is different across groups
if the variables are more ordinal than quantitative
when using non-parametric tests you should…
analyse the rank ordering in the data (not actual scores)
only provide p-values (not CIs)
compare entire distribution rather than just means
how do you summarise non-parametric data
IQR
median
non-parametric test for 2 independent groups
Mann Whitney
non-parametric test for 2 paired groups
Wilcoxon signed-rank
non-parametric for 3 or more independent groups
Kruskal Wallis
non-parametric for 3 or more paired groups
Friedman
advantages of non-parametric tests
they are always valid for quantitative data
parametric only valid if assumptions are satisfied
disadvantages of non-parametric tests
no CIs
based only on analysis of ranks
no direct inferences about a parameter
what defines a large sample sizw
sample greater than 50
how do you calculate variance
SD squared
how do you calculate whether the variances are ‘equal’
variance in one group should be no more than 4x the variance of the other group
how can you compare CIs between groups
calculate a single CI for the difference between groups
effect of proportion on odds
the higher the proportion the higher the odds
how do you calculate proportion
no. of participants in category of interest / total no. of participants
relationship between exposure variable and outcome variable
the exposure variable is the potential cause of the outcome variable
tests for binary hypothesis testing
chi-squared (large samples)
fisher’s exact (small samples)
risk difference of 0
no risk difference
groups equally likely to have the disease
how do you calculate risk difference
proportion in group A - proportion in group B
how do you calculate risk ratio
proportion in group A / proportion in group B
what do risk ratio and odds ratio quantify
the strength of association between the intervention and binary variable
risk ratio = 1
no difference in risk between two groups
NNT stands for
number needed to treat
how do you calculate NNT
1 / risk difference
what is NNT
the number of people that need to receive intervention before 1 person benefits from it
what is NNT better for
quantifying the impact of an intervention in a given population
what does NNT do
measures the effectiveness of an intervention
based on risk difference
what is correlation
the association between two variables
graphical description of correlation
scatter plot
outcome = y-axis
predictor = x-axis
numerical description of correlation
correlation coefficient
pearson’s = linear
spearmans = non-linear
assumptions for spearmans correlation coefficient
non-linear correlation
e.g. curved line
must be ‘monotonic’ - either never -ve or never +ve
e.g. graph cannot be U-shaped
if r squared = 1
then all the variation in one variable is explained by the other variable
what is the predictor
the independent variable
the explanatory variable - potential cause of the outcome variable
what is the least squares regression line
line that makes the vertical distance from the data points to the regression line as small as possible
what is a residual (e)
the vertical distance between the observed data point and the regression line (predicted value)
equation for calculating erros in prediction
outcome = a + b(predictor) + e
are most biological variables are continuous?
yes
e.g. blood pressure
why is it impossible to choose a cut-off line to correctly classify all subjects to a disease status
most distributions of diagnostic test scores will overlap
what are the probability-based estimates of accuracy
specificity
sensitivity
PPV
NPV
what factors affect sensitivity of a test
the severity of the disease
assumption for sensitivty test
population shave similar disease severity
what factors affects specificity of tests
if symptoms show on non-disease patients specificity is reduced
what does PPV quantify
the likelihood that somebody has the disease based on the test result
how does the prevalence of a disease affect the PPV
if a disease has a greater prevalence (is more common) then the PPV will increase