Statistical models 1 Flashcards
Why summarise data
- We can make general statements beyond specific observations.
- Typically done using tables or graphs.
Summarising data using tables
- Frequency distributions
- Cumulative distributions
Summarising data using graphs
- Histograms
What is a distribution?
Information about the data you have for one variable.
Properties of distributions
* What the central tendency is (mean, median or mode).
* How symmetrical the data is either side of the mean (skew).
* How variable the data is (e.g. data range, standard deviation and kurtosis).
* If it’s a “normal distribution”.
Central tendency (the average)
- Mean: (sum of values) divided by (number of values).
- Median: middle value in a list ordered from smallest to largest. 50th percentile.
- Mode: most frequently occurring value on the list.
Skew (symmetry of distribution)
Positive skew: tail points to right or positively
Negative: tail points to left or negatively
Normal is symetrical
Kurtosis
Positive kurtosis
Leptokurtic: centre very high
Negative kurtosis
Platykurtic: centre very flat
Normal distribution
Mesokurtic: normal bell curve
Normal distribution
Symetrical bell curve where mean, mode and median are close
Variability
How spread out a set of data is.
Range
The range of a variable is the biggest value minus the smallest value. Vulnerable to extreme scores.
Interquartile range
The interquartile range (IQR) is like the range, but instead of the difference between the biggest and smallest value the difference between the 25th percentile and the 75th percentile is taken. Used a lot.
Mean absolute deviation
Mean absolute deviation is the mean of all of the absolute deviation scores of a data set. An absolute deviation is the difference between the score and the mean. Used sometimes.
Variance
The variance is the mean of the mean absolute deviation scores squared. Not used much.
Standard deviation
The square root of the variance. Used the most.
In general, you should expect 68% of the data to fall within 1 standard deviation of the mean, 95% of the data to fall within 2 standard deviation of the mean, and 99.7% of the data to fall within 3 standard deviations of the mean.
Standard score
raw score - mean, divided by standard deviation.