Measures of centrality, spread/dispersion, correlation Flashcards
What are the measurement levels?
- Nominal/ ordinal (categorical)
* Interval/ ratio (continuous and numerical)
What is a measure of centre?
Is the point around which most of the data is concentrated.
What are the ways to measure centrality?
Mode
–> nominal
Mean
–> interval/ratio (continuous) –> no outliers
Median
- -> interval/ratio (continuous) –> with outliers
- -> ordinal
How do you find the mode?
Count how many times each value appears in the data set and choose the one that occurs the most.
Strawberry: 15 = the mode
Chocolate: 10
Pistachio: 3
How do you find the mean?
What is the formula?
• Add up all the values and divide them by the number of values.
X(bar) or u = ∑x/n
X(bar on top) or u = the mean ∑ = sum up x = a value ∑x = sum of all the values n = the number of values
Conclusion:
Mean = sum of values/ number of values
How do you find the median?
FACT: the median is the middle value
Unequal numbers of values (9)
- order all values from low to high
- count the number of values
- split the number of values in 2 and round it upwards
- count from the beginning to the value from step.3 to find the middle value
Equal numbers of values (8)
1. The median is the mean (average) of the middle two values.
FACT: when equal numbers of values –>
The median is the mean of the middle two values
-
What is spread (dispersion)?
Spread designates how much data values differ from each other and from the measure of centre.
Short:
How much values differ:
• from each other
• from the measure of centre (mode, mean, median)
What are the 4 measures of spread?
- Range (Inter quartile range)
- Mean absolute deviation (MAD)
- Variance
- Standard deviation
What is range?
FACT: range is highly influenced by outliers
Range is the difference between the highest value and the lowest value.
Range = maximum - minumum
What are quartiles?
What is the Interquartile range?
FACT: IQR good to use when there are outliers.
Quartiles =
Are the numbers that split your data into four equal parts (25% of data per part)
If equal numbers:
Calculate the mean between two values
Interquartile range =
The middle 50% of the data.
Q3-Q1 = IQR (the middle 50%)
What is the Mean Absolute Deviation?
What are the disadvantages of the MAD?
It calculates the average absolute deviations
(deviation = difference between the mean and the value)
Disadvantages =
- Mathematical difficult to optimise
- Not enough emphasis on extreme values
What is variance?
What are the disadvantages of variance?
What do you use for the calculation?
The measure of spread that looks at squared deviation.
The variance is the mean squared deviation from the values to the mean.
–> When the variance is high, the spread is also high
(because in the formula you subtract the mean from the value)
Disadvantages =
• The unit of the variance is different from the unit of the variable (interpretation of unit)
Calculation:
For the variance we use the mean as the measure of centre, because the mean included every value of the data.
- Mean
- Value - mean
- square the differences
- add the squared differences –> SS (sum of squares)
- SS/ number of values/observations –> variance
FACT: standard deviation is the STANDARD measure of spread.
Why?
- It is mathematically nice to work with squared differences in optimisation
- Squared differences give more emphasis to extreme values
- Easy to interpret because the unit of the standard deviation is the same as the unit of the ordinal variable
How do you calculate the Standard Deviation?
All steps from Variance
- Mean
- Value - mean
- square the differences
- add the squared differences –> SS (sum of squares)
- SS/ number of values/observations –> variance
+
Standard Deviation = the root of variance
Root of –> SS/n