Exploring Data - Topic 3: Numerical Summaries Flashcards
What are the major features that can be summed numerically?
Minimum
Maximum
Centre (mean, median)
Spread (standard deviation, range, IQR)
What are numerical summaries? Advantages and disadvantages?
Numerical summaries are a collection of measures that try to describe as much as possible about the data set in as few numbers as possible.
An advantage is that it allows for easy communication and comparisons
A disadvantage is that it loses a lot of info when doing so
What is the mean?
It is the average of ALL the data. It is the point where the data is balanced
I.e. the higher readings and lower readings all cancel each other out
How do you calculate mean?
Mean = Sum of the data / size of the data
What is the median?
It is the middle datapoint when the data is ordered from smallest to largest
How do you calculate median?
For odd sized data: It is the middle number (i.e. (n+1)/2)
For even sized data: It is the average of the 2 middle points
What situations would a median be preferred over a mean?
Within datasets with high variation and outliers. This is because the mean will be easily influenced by these other variables, and cause a very different mean.
Thus, medians are used when there are lots of outliers as it provides a better measure of the centre than the mean (as that one is influenced by outliers a lot)
What does robustness/robust mean? Give an example
It is when something is a good summary for skewed data as it isn’t affected by outliers
I.e. the median is said to be robust and is a good summary for skewed data as it isn’t affected by outliers
What is a left skewed dataset?
Also known as a negatively skewed dataset, the tail is headed towards the negative side.
Here, we expect the mean to be smaller than the median
What is a right skewed dataset?
Also known as a positively skewed dataset, the tail is headed towards the positive side
Here, we expect the median to be smaller than the mean
Which meaasure of central tendenc is the best for describing a centre which is skewed?
Median
Mean for a symmetrical distribution
Can there be cases when neither mean or median are helpful in describing centre?
Yes. This typically occurs with bimodal data
What are limitations of the mean?
Doesn’t account for outliers which could influence the mean
Potential for misinterpretation
Market could have changed
What is a limitation of the median?
Not accounting for all data points
What is the wrong way to measure spread?
We could calculate the gap between each data value and the mean, and then average these gaps. However, the average will always be zero because from the definition, the mean gap must be zero, as the mean is the balancing point of the gaps