Topic 3: Numerical Summaries Flashcards
What are the main features of numerical summaries?
- Max & min
- Centre (mean, median)
- Spread (standard deviation, range, interquartile range)
Why do we need to pair centre features with spread?
If there is only centre points included, it can lead to misleading intepretation and instant assumptions regarding the dataset.
Why do we need numerical summaries?
Numerical summeries reduce all of the data to 1 point. Even though this leads to a loss of lots of information, it makes communication and comparison much easier.
What is mean and how is it calculated?
Mean is the balancing point of the data set and takes into account of the whole data. High and lower readings than the mean cancel each other out.
Mean = sum/size
What is median and how is it calculated?
Median is the middle point of the dataset, which takes into account of only 1 or 2 central points.
If the dataset has odd number of readings, the median is unique.
If the dataset has even number of readings, the median is anywhere between the 2 middle points (usually take the average).
When to use mean vs median?
Mean is used for fairly symmetric data.
Median is used for skewed and large data with outliers.
If the data graph is bimodal, neither one is suitable.
What is standard deviation and how is it calculated?
Standard deviation is used to measure how spread the data is compared to the mean.
RMS of gaps from the mean = sqrt[mean of (gaps from the mean)^2]
How much percentage of the dataset is presented if taking account 1SD, 2SD, and 3SD?
1SD: 68%
2SD: 95%
3SD: 99.7%
What is the IQR and how is it calcualted?
What does it represent in boxplot?
IQR is the interquartile range or the range of the middle 50%.
IQR = Q3-Q1 = 75% percentile - 25% percentile
IQR is the length of the box in boxplot.
How is the mean compared to the median in different dataset?
In symmetric data, mean is quite near median.
In left skewed data, smaller data points drag the mean down
–> mean < median
In right skewed data, higher data points drag the mean up
–> mean > median
What is the difference between using (mean,SD) and (median,IQR)?
(Median, IQR) is more robust as they are barely affected by outliers and suitable for skewed data compared to (mean,SD)
What is standard units and how is it calculated?
Standard units measure how many SD is one data point above or below the mean.
Standard units = (data point - mean)/SD
What is coefficient of variation and how is it calculated?
Coefficient of variation is a relative measure of deviation (or combining those two values into 1 summary).
CoV = SD/mean
What features are included in a boxplot?
- Q2: meadian
- Q1, Q3: 25%, 75%
- Lower threashold: LT = Q1 - 1.5*IQR
- Upper threashold: UT = Q3 + 1.5*IQR
- Data points lying outside the threashold are outliers.
Describe the differences between quantile and quartile.
A set of q quantiles divides data into (q-1) equal size sets (in terms of the percentage of data)
Quartile divides data into quarters.