Exploring Data - Topic 3: Numerical Summaries Flashcards
What are the major features that can be summed numerically?
Minimum
Maximum
Centre (mean, median)
Spread (standard deviation, range, IQR)
What are numerical summaries? Advantages and disadvantages?
Numerical summaries are a collection of measures that try to describe as much as possible about the data set in as few numbers as possible.
An advantage is that it allows for easy communication and comparisons
A disadvantage is that it loses a lot of info when doing so
What is the mean?
It is the average of ALL the data. It is the point where the data is balanced
I.e. the higher readings and lower readings all cancel each other out
How do you calculate mean?
Mean = Sum of the data / size of the data
What is the median?
It is the middle datapoint when the data is ordered from smallest to largest
How do you calculate median?
For odd sized data: It is the middle number (i.e. (n+1)/2)
For even sized data: It is the average of the 2 middle points
What situations would a median be preferred over a mean?
Within datasets with high variation and outliers. This is because the mean will be easily influenced by these other variables, and cause a very different mean.
Thus, medians are used when there are lots of outliers as it provides a better measure of the centre than the mean (as that one is influenced by outliers a lot)
What does robustness/robust mean? Give an example
It is when something is a good summary for skewed data as it isn’t affected by outliers
I.e. the median is said to be robust and is a good summary for skewed data as it isn’t affected by outliers
What is a left skewed dataset?
Also known as a negatively skewed dataset, the tail is headed towards the negative side.
Here, we expect the mean to be smaller than the median
What is a right skewed dataset?
Also known as a positively skewed dataset, the tail is headed towards the positive side
Here, we expect the median to be smaller than the mean
Which meaasure of central tendenc is the best for describing a centre which is skewed?
Median
Mean for a symmetrical distribution
Can there be cases when neither mean or median are helpful in describing centre?
Yes. This typically occurs with bimodal data
What are limitations of the mean?
Doesn’t account for outliers which could influence the mean
Potential for misinterpretation
Market could have changed
What is a limitation of the median?
Not accounting for all data points
What is the wrong way to measure spread?
We could calculate the gap between each data value and the mean, and then average these gaps. However, the average will always be zero because from the definition, the mean gap must be zero, as the mean is the balancing point of the gaps
Thus, what is the new, better way to measure spread?
A better alternative to calculating the mean of the data minus mean is through standard deviation. This involves the RMS (Root, mean, square)
The RMS measures the avg set of numbers, regardless of sign
What are the steps to RMS?
Square the numbers (Each data point - mean) , mean the result, then root it
What does standard deviation population measure and what is the equation to measure it?
Measures the s.d. of the population
Root ((Sum of (x - average) ^2) / N )
What does standard deviation sample measure, and what is the equation to measure it?
Measures the s.d. of the sample
Root ((Sum of (x - average)^2 / (N-1))
How do you know the differnece of when to use sample and population s.d.?
Population SD is when you have every set of data available, sample SD is when you don’t have every set of data available (with reference to the research question)
It is very much dependent on what the research is trying to achieve
For example;
assume ‘data’ =Newtown property prices during April - June 2017
if just looking at ‘Newtown Property Prices during April-June 2017’, then the ‘data’ is the whole population
If we want to look at all property prices/ prices in general, then the ‘data’ is considered a sample
What % of data lies within what s.d. from the mean
68% of data –> 1 s.d. from the mean
95% of data –> 2 s.d. from the mean
99.7% of data –> 3 s.d. from the mean
What are standard units?
They are also known as z scores, and determines how many standard deviations a score is above or below the mean
To compare2 data points, we can compare with reference to standard units
How is a standard unit (z score calculated)
(Data point - mean) / SD
What is the Interquartile range (IQR)?
It is the range of the middle 50% of the data. More formally, IQR = Q3 - Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile
What are quartiles?
Quartiles divide the data into quarters
What is the mathematical reasoning for what an outlier is?
The lower and upper thresholds is both a distance of 1.5IQR from the Q3 or Q1
Lower threshold = Q1 - 1.5IQR
Upper threshold = Q3 + 1.5IQR
What are the pairs of measures of central tendency and variability that we have to do?
Mean and standard deviation
Median and IQR
What is coefficient of variation? (CV)
It is used to compare data dispersion between distinct series of data
WHat is the formula for coefficient of variation?
SD / Mean
What are examples of what CV is used for?
Analytical chemistry
Engineering + physics
Economics for determining volatility of securities