1 Statistics Basics Flashcards
Simple random sampling
Randomly selecting from everybody in the sample.
Stratified sampling
Creating different groups/strata and picking from each proportionally (to the overall group). Usually a large strata.
Systematic sampling
Chooses by selecting every nth term. The attribute being studied should be randomly distributed.
Convenience sampling
Based on ease of selection. E.g: people physically closer to you are more likely to be picked than someone in the back row who you can’t even really see.
Cluster random sampling
Divides population into different coherent areas then randomly select areas to assess.
Snowball sampling
Finding people who are suitable for the study and then asking them to refer others they know who would also be suitable for the study.
What is probability sampling
epresentative of the population as every individual has the same probability of being selected
For symmetric data we use…
mean and SD
For asymmetric data we use…
median and IQR
When a z scores used
when the values in question do not fall on specific reference ranges of the 68 rule.
Steps of a basic z score
- calculate the z scores.
- Search it in the table to find the corresponding area above these values.
- Use the overlap of area to find only the desired area.
What is a t distribution
Like normal distribution but takes into consideration degrees of freedom.
flatter/longer than a normal distribution peak.
- inc degrees of freedom
- inc sample size
the T distribution becomes more like the normal distribution.
What is degrees of freedom
(the number of data values that can change)
What is the central limit theorem
As n, the population, of a sample increases, the sample data is less likely to be skewed (more people = more likely outliers etc.).
The more samples we include on the mean distribution graph, the more it will look normally distributed, even if the initial data is skewed.
What is standard error
the standard deviation of the sampling distribution
Why is hypothesis testing used
analyse if the results in a sample are due to chance and if they are similar to the total population the sample came from.
WE ONLY TALK ABOUT THE
NULL NOT THE ALTERNATE.
What is a type 1 error
reject the null hypothesis even though it is true.
represents observations of the null hypothesis due to chance.
We use a p value to test this error.
What is a type 2 error
keep the null hypothesis even though it is false.
What does a p value represent
It represents the times an observance was due to chance
OR
the probability of a type 1 error occurring.
p>0.05 =
a lot of chance involved,
likely a type 1 error will occur,
insufficient evidence to reject H0.
p<0.05 =
not much chance involved,
type 1 error unlikely,
statistically significant to reject H0,
What does a 95% confidence interval represent
represents the interval of values that we are 95% confident will contain the sample statistic representing the whole population (not just the sample used).
If confidence interval contains the sample statistic value represent the null hypothesis then we cannot reject the null hypothesis.
How do u find a t multiplier
from the table
Ways to test equal variance
Graph and compare dispersion, similar = equal
(Larger SD)^2/(smaller SD)^2, ratio>=2 then unequal
Use hypothesis levenes test with equal and unequal hypotheses, p<0.05 = unequal
Relative Risk (RR)
THE RISK OF GETTING THE DISEASE
Comparing people getting disease in exposed to unexposed
= Cumulative incidence (exposed)÷ Cumulative incidence (unexposed)
=incidence rate exposed / incidence rate unexposed
Convert RR to %
Increase = RR-1 x 100
Decrease = 1- RR x 100
Attributable Risk AR
Shows amount of disease due to just exposure
Cumulative incidence (exposed) – Cumulative incidence (unexposed)
% = risk in exposed group – risk in unexposed group / risk in exposed group x 100
Population Attributable Risk (PAR)
incidence in general population – incidence in unexposed group
Odds Ratio (OR)
THE ODDS OF BEING EXPOSED
We speak about OR in terms of exposure vs non-exposure
= (exposed cases/non-exposed cases) / (exposed controls/non-exposed controls)
Explain talking about OR
The odds of developing DISEASE among EXPOSURE is OR times more/less than NON-EXPOSURE.
What are the assumptions when doing linear regression
y follows a normal distribution (check by histogram or box plot)
Relationship between y and x is linear (check with scatterplot)
There is constant variance of the outcome across different values of the x (check with residual plot)
What is the beta coefficient
represents the amount of change in y for every unit change in x
How do we select significant variables
run the regression model with all variables first,
identify the insignificant covariate (could be a p>0.05) and then drop that covariate.
repeated until all significant.
What does a chi square test do
compares two categorical variables to see if the variation in data is due to chance, or due to the variables being tested.
compare the data of observed frequencies with what we would expect to occur if the null hypothesis was true.
Why is logistic regression different to linear regression
the response variable is binomial rather than continuous.
aim of logistic regression is to obtain an odds ratio.
Evaluating logistic regression model: chi square
Hosmer‐Lemeshow “goodness of fit” statistic fins chi square statistic
low = p>0.05 = good fit
high = p<0.05 = poor fit
Evaluating logistic regression model: ROC
Receiver Operator Characteristic (ROC) and c-index
How well does the model discriminate between patients who develop / do not develop the outcome (according to prediction)
use a c-index to test how accurate the model is.
What is external validity
use a creation data set, then a validation data set to ensure the model works.
Sometimes models are good in creation dataset but not valid in the other dataset.
Therefore no appropriate
What is a Hazards Ratio (HR)
help measure the effects of an intervention on an outcome of interest over time.
ratio of an individual at a particular time point following an intervention.
= Hazard in intervention / hazard in control
o HR > 1: Factor increases risk of event.
o HR < 1: Factor decreases risk of event (i.e. protective).