basic data Flashcards
part 2 stats
types of data
Nominal Data
This describes categorical data without an order. Examples include blood groups (O, A, B, AB), eye colour and marital status.
Ordinal Data
Ordinal data are also categorical, but in this case categories have an order and can be ranked. Examples include stages of breast cancer. Importantly the “distances” between the different groups can be variable. For example, Likert responses may have the options “strongly agree”, “agree”, “neither agree nor disagree”, “disagree” and “strongly disagree”. Clearly this can be ordered, so it is an example of ordinal data, but it is apparent that the difference in agreement between “agree” and “strongly agrees” may not be the same as that between “agree” and “neither agree nor disagree”.
Binary data
Binary, or dichotomous, data have only two possible outcomes. Common examples are Yes/No or True/False responses, but they could also include other common epidemiological outcomes, such as “survived” and “not survived”.
Numeric data
Numeric data can be discrete or continuous. Discrete data have fixed values. Examples include shoe size or number of people. Continuous data can take any value, frequently within a given range. Examples include weight and length (where the range would be from zero to, theoretically, infinity).
types of data scales
(i) A nominal scale uses numbers purely as a label and there is no intrinsic order to the values, for example, ethnic group. A nominal variable, is used for mutually exclusive, but not ordered, categories. For example, a study might compare five different countries. You can code the five countries with numbers, but the numerical order is arbitrary.
(ii) Ordinal scales are qualitative, and ordered, but without any mathematical relationship between the points, for example, social class. An ordinal variable, is one where the order matters but not the difference between values.
(iii) Interval scales are ordered but the intervals between consecutive points on the scale are equal. That is, interval scales are where the difference between two values is meaningful (e.g. temperature in centigrade or Fahrenheit).
(iv) Ratio scales are interval scales but with a true zero, e.g. weight. That is, ratio scales have all the properties of interval scales, and also have a clear definition of zero (e.g. height or weight).
https://www.fph.org.uk/media/1223/june-2011-final.pdf
how can you measure the spread of data
- range
- IQR
- variance/ SD
- coeffient of variation (sd/n)
what is variance and standard deviation?
in formula sheet
what is standard error of the mean?
standard deviation of the sample distribution
in formula sheet
95% of sample means will fall within 1.96 SEM of the population mean –> pop mean within 1.96 SEM of the sample mean 95% of the time
what is the normal distribution
symetrical around the mean (median and mode), bell shaped
what is the p value
the probability of getting the observed value, or one that is more extreme, if the null hypothesis were correct.
unpaired z test
parametric test to see if difference between 2 groups if large n
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
- A z-score of 1.96 is equivalent to a two-tailed p-value of 0.05; therefore, a z-score >1.96 can be considered statistically significant at the 5% level
- for proportions se calculated by
-se≈ √(p(1−p)/n1)+(p(1−p)/n2)
where p = average proportion for the two groups
paired z test
paraemtric if large and paired data
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
where d = mean of the differences between the samples,
D= hypothesised mean of the differences (usually this is zero),
n = is the sample size and
σ2 = is the population variance of the differences.
unparied t test
if small n (<30 normally)
parametric test
1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.
ANOVA
parametric test to compare mean of one exposure between 2+ groups
can do 2 way, multi if more than one exposure
assumptions:
- outcome normally distributed,
- SD same for each exposure
liner regression
normal distribution, linear relationship
can a pearsons correlation co-efficient (parametric)
what is Bayes therom
P (A|B) = P(A n B) / P (B)
P(A | B) = P (B|A) x P (A)/ P(B)
chi squared test
test for independance
large sample size (n>5 for each square)
to test if r x c are independent or if there is an association
H0: variable 1 and variable 2 are independent.
H1: not independent.
for 2 x2 (1df) chi squared > 3.84 for p<0.05
how to calculate:
1. create 2x2 table
2. calcualte expected ((row sum * column sum) / table sum.)
3. chi sqaured formula to work out number
4. Is it >3.84, reject H0, they are associated.
use fisher exact test is n small
chi squared test for trend
ordered categorical exposure variables. It tests the null hypothesis that there is no linear increase in the log odds per exposure group.
eg menarche and small/medium/large fold test)