Statistics Flashcards
What different types of data are there?
Quantitative - ‘values you can do maths on’, which is further subdivided into Continuous and Discreet:
Continuous data values depend on the unit and precision of measurement. E.g. height, BP, etc.
Discreet data values are whole numbers. E.g. number of children, length of stay in days
Quantitative data can also be described as:
interval data, where a value of zero does not mean ‘no measurement’, e.g. temperature in ‘C
vs
ratio data, where zero means no measurement, e.g. heart rate, height, etc.
Qualitative - ‘values you can’t do maths on’, which is further subdivided into Ordinal and Nominal:
Ordinal data values can be arranged in a particular order or ranking, but the increments between groups are not equal, not known, or not measurable. E.g. ASA grade
Nominal data has groups with different labels, but no order or ranking. E.g. Gender
How do you choose which statistical test to use to analyse data?
In order to analyse data with appropriate statistical tests, we must first look at the characteristics of the data.
- Nature of the data (Quantitative, either continuous or discrete vs Qualitative, either ordinal or nominal)
- Distribution of the data (Normal/Parametric vs. Non-normal/Non-parametric)
- Number of groups (2 vs more than 2 groups)
- Paired vs unpaired groups
(Paired data is when comparison is made of the same group under two separate scenarios, whereas unpaired data is when comparison is made of two unrelated or independent groups)
What is normal/parametric and non-normal/non-parametric data?
These terms refer to the distribution of data within a dataset.
Qualitative data is always non-normal/non-parametric.
For quantitative data, a distribution curve can be created by plotting observed values on the x axis, and frequency on the y axis. If data is normally distributed, the curve is symmetrical and bell-shaped, and the mean, median and mode are the same.
If the data is non-normally distributed it is asymmetrical (such as skew in either direction) or not bell-shaped (such as bimodal distribution).
With skew, if the tail extends to the right this is rightward or positive skew, and if the tail extends to the left this is leftward or negative skew. For skewed data the mode, median and mean will not be the same. The mode is the most frequently occurring value and is the peak of the curve. The median has equal numbers of values above and below it, and moves towards the tail of the skewed data. The mean is the average of all the recorded values and also lies towards the tail of the skewed data.
How would you describe data?
Quantitative data can be described according to its:
Central tendency (i.e. mean, median, and mode)
Mean (x̄) = Σx/n; Median = middle value; Mode = most common/frequent value.
For parametric data, we usually use the mean. For non-parametric data, we usually use the median value.
Spread (i.e. variance and standard deviations) of parametric data
Variance = Σ(x-x̄)^2 / (n-1)
[in original units squared]
SD = √ Σ(x-x̄)^2 / (n-1) [in original units]; SD1 includes 68% data, SD2 includes 95% data (i.e. 2.5% at either end excluded), SD3 includes 99.7% data.
Range (the lowest and highest values recorded) and interquartile range (IQR) for non-parametric data.
How would you display qualitative data in graphical form?
Data is non-numerical, and each group has a label. Data can first be displayed in a frequency table, and then in a bar or pie chart. Each variable can be given a percentage of observation.
How can you transform non-parametric data into parametric data?
Transformation of data refers to the application of mathematical functions to a dataset in an attempt to transform non-parametric data into parametric data, allowing more powerful statistical tests to be applied.
Right skew - √ or Log
Left skew - x^2
Exponential - Log or 1/x
How do you calculate variance?
Variance = Σ(x-x̄)^2 / (n-1)
[in original units squared]
Calculate the mean. Calculate all measured values minus the mean and square these values so that they are all positive values. Take the sum of all these values and divide by the degrees of freedom. The answer will be in the original units squared.
How do you calculate standard deviation?
The standard deviation of parametric data is calculated as the square root of the variance.
SD = √ Σ(x-x̄)^2 / (n-1) [in original units]
SD1 includes 68% data, SD2 includes 95% data (i.e. 2.5% at either end excluded), SD3 includes 99.7% data.
What is meant by the standard error of the mean?
The standard error of the mean is used to determine whether the mean of the sample reflects the mean of the whole population.
It is inherent that the larger the sample size, the more likely the mean of the sample will reflect the mean of the whole population.
Also if the standard deviation is small, and hence the variance around the mean is small, then again you can be more confident that the mean of the sample is close to the mean of the whole population.
The standard error of the mean is calculated by dividing the standard deviation by the square root of the degrees of freedom.
SE = (√ Σ(x-x̄)^2 / (n-1)) / √(n-1)
The standard error of the mean can be thought of as the standard deviation of the mean, so it can be said that 68% of sample means will lie within 1 SE of the true population mean, and etc. for 2SE (95%) and 3SE (99.7%).
What are confidence limits?
Confidence limits are used to describe the range over which the likely true answer will fall OR the range of plausible values base on the observed sample. [It is NOT the range over which we can be 95% confident the true value lies]
Confidence limits are related to the SE of the mean. The range between 2 SE’s above the mean and 2SE’s below the mean is called the confidence interval (CI), and values at either end the confidence limits.
Strictly speaking a 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).
The confidence limits have the same value as the data measurements which make the much easier to interpret.
Can you use the SE of the mean for non-parametric data?
No!
For data that is skewed, the standard deviation does not accurately reflect the variation of data around the mean. Therefore it is impossible to calculate the SE of the mean.
Instead, for non-parametric data we tend to quote the median, the range, and the interquartile range within which the middle 50% (i.e. quartiles 2 and 3) of the results lie.
What do we mean by the p value?
It is numerically the same as the alpha error.
The p value is the probability of getting the observed data, or something more extreme, when the null hypothesis is true.
How odd/surprising is the result I have observed if the null hypothesis is true?
If the result is sufficiently odd, we would assess the size of the difference, the reproducibility of the findings in other studies, and try and determine whether we should reject the null hypothesis.
[It is NOT the probability the null hypothesis is false - if you compared two identical interventions and there was a natural variation in observed results, the p value would clearly not be the probability that the difference was caused by the observed intervention]
What is the null hypothesis?
When comparing groups, the null hypothesis states that there is no difference between them with respect to a particular variable. A research study will then go on to try and disprove the null hypothesis.
What is a type 1 error?
A type 1 error, also referred to as an alpha error, is:
- a false positive
- the null hypothesis is wrongly rejected
- a difference is found when there is none
What is a type 2 error?
A type 2 error, also referred to as a beta error, is:
- a false negative
- the null hypothesis is accepted when there is actually a difference between groups
A type 2 error is affected by sample size, variation in the study population, and when a small difference is clinically important.