lecture 2 - descriptive (summary) statistics Flashcards
what is central tendency?
n average and is the easiest way to summarise data i.e. its where most of the scores are
Most common measure of central tendency is the mean
mean
mean = sum of all scores/ no of scores
x̄ = ∑ x / N
∑X means ‘add up’ or total or ‘sum’ all of X
Use x̄ to signify the mean of a sample
E.g. the mean height of people in this class
Use µ to signify the mean of a whole population
E.g. the mean height of everybody in Cardiff uni.
calculating means
Mean is unique and very sensitive to data so change any score and so does the mean.
The mean only makes sense with interval or ratio measurement as adding things up needs equal intervals.
The arithmetic mean doesn’t need to be on the scale the data was taken from.
The mean doesn’t fully describe the data
mode
Most frequent score, not unique, could have 2 or more scores with same and highest frequency ( called multimodal data). Mode is not very sensitive to changes in data.
median
Middle score, the score above and below which 50% of the data points lie. It is at the (N +1)/2 position ( when scores arranged in order). The Median is unique but not very sensitive to changes in data.
central tendency and measurement
Because of link to inferential stats and normal distribution (more later in the course) mean is most common measure of central tendency for interval & ratio scales.
But median & mode also fine depending on what information is being conveyed.
With ordinal scales can’t use mean therefore median most common (but can use mode).
Nominal scales can’t use median or mean therefore mode most common measure of central tendency.
variability
The degree of ‘spread’ about an average
Mode - no associated measure of variability
Median - interquartile range
Mean - variance and standard deviation
interquartile range
the difference between the 1st quartile (score that has 25% of data below it and 75% above) and 3rd quartile (score has 75% of data below and 25% above) so is the middle 50% of data.
* For small samples it can be easier to take the median and then find the middle of the lower & upper halves.
* Order data (low to high)
7,1,2,6,3,4,6,3,4,5,1,8 becomes 1,1,2,3,3,4,4,5,6,6,7,8
* Divide data into two groups using median (“median split”)
1,1,2,3,3,4,4,5,6,6,7,8 becomes 1,1,2,3,3,4 and 4,5,6,6,7,8
* Find median of lower-rank group: 1st quartile (Q1)
Median of 1,1,2,3,3,4 is 2.5
* Find median of high-rank group: 3rd quartile (Q3)
Median of 4,5,6,6,7,8 is 6
* Interquartile range = Q3 – Q1
IQR is 6 – 2.5 = 3.5
variance
the average squared deviation.
The variance is the average distance of scores from the mean. It is the sum of squares divided by the number of scores. It tells us about how widely dispersed scores are around the mean.
On average how far away from the mean is each score?
Two steps-
1 - find out how far away each score is from mean ie how deviant is the score?
2 - what is the average deviation?
* To find the average deviation same as any other average.
* Add up all the scores and divide by the number of scores
* Problem: The sum of any set of scores X - x̄ is ZERO
* Solution: Square the deviation scores so they are all positive
* The average squared deviation is the variance
* Divide the sum of your squared deviation scores by the total number of scores
So, on average, the scores are spread out 5.06 squared therapy sessions around the mean
standard deviation
The standard deviation is the square root of the variance. It is the variance converted back to the original units of measurement of the scores used to compute it. Large standard deviations relative to the mean suggest data are widely spread around the mean, whereas small standard deviations suggest data are closely packed around the mean.
* Need to get the scores back into the original units
* Take the square root of the variance
* The square root of the variance is called the standard deviation
* The standard deviation is a measure of the average deviation of the scores from the mean, in the original units of the scale
estimating the population SD from a sample
Sample mean a good estimate of the population mean.
But, SD taken from a sample using this formula is not the best estimate of population SD.
Lets look at an example:
Population is 1, 2, 3
Population mean = 2, Population SD = 0.816
Take 3 samples of size 1 from the population
1 & 2 & 3.
Sample means are 1, 2, & 3 so average of sample means is 2 – same as population mean.
Sample SDs all 0. This underestimates the population SD!
population variance
σ ² x = ∑( X - µ ) ² / N
where
X are the data
µ is the population mean
N is the number of data points that make up the population
sample variance
s ² x =∑ (X - x̄) ² / N - 1
where
X are the data
x̄ is the sample mean
N is number of data points that make up sample
Divide by N-1 to get Sample variance as this gives an unbiased estimate of the population variance.
∑ (X - x̄) ² sometimes known as sum-of-squares (SS)
standard deviation - population and sample
- Variance has ‘squared’ units
- if data are heights in metres, variance is in metres-squared (i.e.. is an area not a height!)
- So, take the square-root
- new measure of variability now has same units at original data
Square-root of variance is the Standard Deviation
- new measure of variability now has same units at original data
population standard deviation
σ x = √σ ² x