lecture 4 - visualising data beyond summary statistics Flashcards
summary of data visualisation
- Only for unimodal and symmetrical distributions do all the “averages” we have considered genuinely “fit” the central tendency of the distribution.
- Consider bimodal data: By definition, the mean or median can’t fall at the mode (because the mean and median are unique, but in bimodal data there is more than one place where the data lumps together).
- Visualising only summary statistics (either when exploring the data yourself, or when presenting it to others) hides information about the full distribution. In turn, this means (among other problems):
- We can’t see whether or not the summary statistics truly do a good job of representing the whole set of data.
- We are “ignoring” information about individual differences.
We are “ignoring” whether the distribution fits any assumptions we might want to make about it
problem with summary stats
- The key “problem” with the summary-stats-only approach to data visualisation is that it presents only a subset of the information in the data.
- So, the key principle behind more advanced/modern methods of presentation is to include as much of the information in the original dataset as is possible (without creating a “can’t see the forest for the trees” problem).
There are also lots of other things modern visualisation methods can bring – but for today I’ll be concentrating on ways to give a “richer” view of the data.
summary statistics representation
Plotting mean (with SEM: remember SEM = SD/(square root of sample number).
Looking at it in a bar chart – Appears that the mean is the same for all samples – and only very minor differences in variability.
So – from a “classic” summary statistics approach, seems like the 6 sets of data look very similar. - raincloud plot is a better representation
SEM - standard error of the mean - represents variability of sample mean of population
the ‘raincloud’ plot
works for ratio and interval data as needed for summary statistics.
Three “elements” (from the bottom up):
The “rain” or individual data (often “jittered” to help show the density of scores when they overlap);
A “summary” plot – here a mean with SEM as before (but more commonly a boxplot);
And the “cloud” – a smoothed representation of the distribution eg normal, unimodal, bimodal, skewed
its a type of ‘rich’ data representation
it does a good job of representing the whole sample as we can see the underlying data and not just the summary statistics
can also rotate plot
graphs
on word doc
interim summary
- Only unimodal and symmetrical distributions are reasonably fully represented by summary statistics alone.
- BUT, even if data is unimodal and symmetrical, would not know if only ever looked at summary statistics.
- The “raincloud” is one plot that seeks to overcome this problem by representing summary statistics PLUS individual data and distribution.
- There are many other ways to achieve similar ends – but they all share the property of representing summary statistics plus the distribution and/or individual data.
- Boxplots are an “classic” plot that also shows some distributional information, but not as richly as the raincloud.
So, boxplots can be a good start at richer data representation (and perhaps easier to produce than rainclouds or the like), but are not quite as informative.
Boxplot
Classic” boxplot shows:
Median (thick central bar).
1st and 3rd quartile (top and bottom of box) – the difference is the IQR.
Range of data within 1.5 IQR of 1st and 3rd quartiles (the “error bars”).
Any outliers (i.e. outside the range noted above).
notes on making raincloud
- Neither SPSS nor Excel make raincloud plots.
- But SPSS does histograms and violin plots (the “violin” is another way of representing the same distribution information as the “cloud” part of a raincloud plot); and lots of programs do boxplots.
- I happened to make these plots using “JASP”.
- This is a free, open-source, tool for statistical analysis and visualisation (“Jamovi” is another free, open-source, tool).
- Both JASP and Jamovi are actually nice user front-ends for the programming language R (again, free and open-source).
- R (as well as Python – another free and open-source language) has many good statistical and visualisation tools/packages.
- BUT – we do not expect you to use them things beyond SPSS or Excel (although feel free to do so if you like).
What matters is understanding the limitations of summary-only presentation, and how one might address that problem.
making inferences about data
- Is there a difference between groups?
- E.g. does psychoanalysis help depressed patients? Does anxiety differ between groups taking different amounts of exercise?
- But how do we make decisions when we only have a sample of all the possible data?
- Just seen examples where there are suggestions of a difference between groups, but also overlap in data to question whether that suggestion is “real”.
- Need something better than “looks like a difference to me”…
Statistical inference is about formalising this sort of comparison, essentially asking about whether the data fits the assumption that two groups are the same or not.
analogy to statistics
Want to maximise ‘correct’ and minimise ‘wrong’ decisions.
Get around not knowing the ‘truth’ by introducing ‘reasonable doubt’ criterion.
Indicates that an event is possible but not probable.
Hypothesis testing (an example of formal statistical inference) is a direct analogy to the court-case we imagined.
Many statistical methods to help in deciding what is not probable (in the long run).
Much of the rest of the course will be about this sort of statistical inference.