descriptive statistics Flashcards
Categorical Data (factor, nominal)
- No particular relationship between the different possibilities
- Example: what prison sentence does someone have?
- Answers might be suspended, determinate, indeterminate
- Can’t average them or do maths with them
- Doesn’t make sense to talk about “average” prison sentence
- But you could talk about the most / least frequently occurring prison sentence
Continuous data (interval)
- Goes in a specific order
- Example: what is a patient’s weight today?
- Can do maths with interval data
- It would make sense to talk about average weight or weight increase or decrease
- Its meaningful to say someone who is 60kg is 10kg heavier than someone who is 50kg
Ordinal Data (ordered categorical, ordered factor, Likert scale)
• Like categorical but there is an order to the sequence
• Example: how tired are you feeling right now? Pick one of the following options
1. Very tired
2. Tired
3. Alert
4. Very alert
• Can’t do maths with ordinal data
• Like categorical data, we can talk about the most chosen and least chosen options, but not the average tiredness
Mode
• The score / value / number / response that happens most often
• You can have more than one mode
• One mode = unimodal
• Two modes = bimodal
• Can take modes of continuous data too
What is the mode of the variable bdi.8m, shown on the right? Interpret the result in the context of the data
A score of 0 is the mode (it has 7 appearances or “counts” in the data)
median
- The middle number
- Only useful for continuous data
- Sort data in ascending order
- Find the number in the middle of the dataset
- You cannot have more than one median
- If there wasn’t a middle value you’d take the average of the two middle values
mean
What most people mean when they say “average”
Only useful for continuous data
It is the sum (total) of all the values divided by the number of values
Would need to know more about the measurement scale used Let’s formalise that in a formula X ̅=Σx/n The mean has outliers
variability
- Talked about centres or “averageness” in the data
- Another type of statistic we need to calculate to understand our data: measures of variability
- How spread out the data are
- How far away from the mean or median do the datapoints tend to be
- Bdi.pre mean was 23.33 – how near to this value do most of the patients’ depression scores tend to be?
range
- Simply the highest – lowest value
- 14 - 0 = 14
- Know the boundaries of our data
- Useful to detect outliers or data input errors
- But doesn’t tell you how common really high or low numbers are
interquartile range
- Split our data into quarters
- Each quarter contains 25% of the datapoints
- To do that we need to find the quartiles
- The three points that split the data into the quarters
interpreting the IQR
- For first 13 people in variable bdi.8m have depression score IQR of 9
- The IQR is the range (max-min) of the middle 50% of the data
- IQR plays a key role in data visualization (boxplots)
- IQR is useful as it is not as affected by outliers compared to the following measures
variance
- How far numbers are spread out from the mean
- Big number that isn’t useful on its own
- Feeds into other statistics that we’ll use lots
interpreting the variance
- Variance is
- A really big number
- Not in the original units
- It is difficult to interpret – this is a problem with using the measure
- Not useful to say that the spread of depression scores before treatment was 89.12 around the mean
- So we need to ‘undo’ the squaring that we did earlier
- Means that the variances is interpretable in the same units as the data
- Which gives us the standard deviation….
standard deviation
just the square route of the variance
interpreting standard deviation
- The average distance between the values in the dataset and the mean of that dataset
- Most often used to understand the variability in continuous / ordinal data
Calculating skew: Pearson’s coefficient of skewness
- Negative number means data are negatively skewed
- Positive number means data are positively skewed
- Symmetrical data has skew of zero
μ = mean ν = median σ = standard deviation