basic data Flashcards
part 2 stats
types of data
Nominal Data
This describes categorical data without an order. Examples include blood groups (O, A, B, AB), eye colour and marital status.
Ordinal Data
Ordinal data are also categorical, but in this case categories have an order and can be ranked. Examples include stages of breast cancer. Importantly the “distances” between the different groups can be variable. For example, Likert responses may have the options “strongly agree”, “agree”, “neither agree nor disagree”, “disagree” and “strongly disagree”. Clearly this can be ordered, so it is an example of ordinal data, but it is apparent that the difference in agreement between “agree” and “strongly agrees” may not be the same as that between “agree” and “neither agree nor disagree”.
Binary data
Binary, or dichotomous, data have only two possible outcomes. Common examples are Yes/No or True/False responses, but they could also include other common epidemiological outcomes, such as “survived” and “not survived”.
Numeric data
Numeric data can be discrete or continuous. Discrete data have fixed values. Examples include shoe size or number of people. Continuous data can take any value, frequently within a given range. Examples include weight and length (where the range would be from zero to, theoretically, infinity).
types of data scales
(i) A nominal scale uses numbers purely as a label and there is no intrinsic order to the values, for example, ethnic group. A nominal variable, is used for mutually exclusive, but not ordered, categories. For example, a study might compare five different countries. You can code the five countries with numbers, but the numerical order is arbitrary.
(ii) Ordinal scales are qualitative, and ordered, but without any mathematical relationship between the points, for example, social class. An ordinal variable, is one where the order matters but not the difference between values.
(iii) Interval scales are ordered but the intervals between consecutive points on the scale are equal. That is, interval scales are where the difference between two values is meaningful (e.g. temperature in centigrade or Fahrenheit).
(iv) Ratio scales are interval scales but with a true zero, e.g. weight. That is, ratio scales have all the properties of interval scales, and also have a clear definition of zero (e.g. height or weight).
https://www.fph.org.uk/media/1223/june-2011-final.pdf
how can you measure the spread of data
- range
- IQR
- variance/ SD
- coeffient of variation (sd/n)
what is variance and standard deviation?
in formula sheet
what is standard error of the mean?
standard deviation of the sample distribution
in formula sheet
95% of sample means will fall within 1.96 SEM of the population mean –> pop mean within 1.96 SEM of the sample mean 95% of the time
what is the normal distribution
symetrical around the mean (median and mode), bell shaped
what is the p value
the probability of getting the observed value, or one that is more extreme, if the null hypothesis were correct.
unpaired z test
parametric test to see if difference between 2 groups if large n
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
- A z-score of 1.96 is equivalent to a two-tailed p-value of 0.05; therefore, a z-score >1.96 can be considered statistically significant at the 5% level
- for proportions se calculated by
-se≈ √(p(1−p)/n1)+(p(1−p)/n2)
where p = average proportion for the two groups
paired z test
paraemtric if large and paired data
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
where d = mean of the differences between the samples,
D= hypothesised mean of the differences (usually this is zero),
n = is the sample size and
σ2 = is the population variance of the differences.
unparied t test
if small n (<30 normally)
parametric test
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
ANOVA
parametric test to compare mean of one exposure between 2+ groups
can do 2 way, multi if more than one exposure
assumptions:
- outcome normally distributed,
- SD same for each exposure
liner regression
normal distribution, linear relationship
can a pearsons correlation co-efficient (parametric)
what is Bayes therom
P (A|B) = P(A n B) / P (B)
P(A | B) = P (B|A) x P (A)/ P(B)
chi squared test
test for independance
large sample size (n>5 for each square)
to test if r x c are independent or if there is an association
H0: variable 1 and variable 2 are independent.
H1: not independent.
for 2 x2 (1df) chi squared > 3.84 for p<0.05
how to calculate:
1. create 2x2 table
2. calcualte expected ((row sum * column sum) / table sum.)
3. chi sqaured formula to work out number
4. Is it >3.84, reject H0, they are associated.
use fisher exact test is n small
chi squared test for trend
ordered categorical exposure variables. It tests the null hypothesis that there is no linear increase in the log odds per exposure group.
eg menarche and small/medium/large fold test)
McNemars Test
- used when have paired data
- to see if the outcome and exposure are independent
look at discordant!!
o Assumption 1: You have one categorical dependent variable with two categories (i.e.,a dichotomous variable) and one categorical independent variable with two related groups.
o Assumption 2: The two groups of your dependent variable mutually exclusive.
o Assumption 3: The cases are a random sample from the population of interest.
for 1 df (X2 distribution)
X > 3.84 p <0.05!!
create 2 x2 table
o + -
+ a r
- s b
in formula sheet
Direct standardisation
way to adjust for age if you have age specific rates for the study population
procedure:
1. identify standardised population
2. age specfific rate from study population x standard population number for that strata
3. sum all of these up
4. sum (ASR from study pop x standard pop number) / total standard population = Age standardised rate
look at pattern of change of rates in each strata are the same
if 2 can calculate compartive mortality ratio: just divide
Indirect standardisation
for when you DONT have age specific rates for the study population
procedure:
1. identify standardised population
2. apply standard population age specific rates to the study population to get EXPECTED number of deaths
3. SMR = observed/ expected
how much more/les likey to (die) compared to someone of the same age/sex in the standardised population
(if 1 same)
dont compare different SMR as may have different underlying populations
Wilcoxon signed rank
non parametric
similar to paired t test
null: median of differences between paired oberservations = 0
W > test statistic: = fail to reject the null
opossite to everything else where a bigger value then the test statistic would mean p even lower then that threshold value
Wilcoxon rank sum/ Mann-Whitney U
non parametric
similar to unpaired t test
H0: difference between the medians will be 0
opossite to everything else where a bigger value then the test statistic would mean p even lower then that threshold value
bootstrapping
take repeated samples from sample population with replacement
if do this 1000 of times can create CI
systematic review
the application of scientific strategies that limit bias by the systemematic assembly, critical appraisal and synthesis of all relevant studies on a specific topic.
Likelihood ratio (+v)
sensitivity/ 1 - specificity
P(test positive and have disease) / P( test positive and dont have disease)
post test probability
Post-test probability = post-test odds / (post test odds+1)
Post-test odds = pre-test odds * LR
Pre-test odds = pre-test probability / (1-pre-test probability) (for population screening it is the PREVELENCE OF DISEASE)
ROC curve
axis and uses
x: 1 - specificity (false positive)
y: sensitivity (true positive)
Uses:
- to set a cut-off value for a test result (for continuous diagnostic variables)
- to compare the performance of different tests measuring the same outcome (test validation)
Area under ROC: AUROC = larger = better test
What type of regression analysis should be used to assess the difference in survival time
cox regression
Kruskal-Wallis
It is a non-parametric test
It is a rank-based test
it is used to test whether two or more independent groups differ.
It is the nonparametric version of one-way independent ANOVA (1 mark)