Numerical Summaries Flashcards
Histogram
visualise quantitative results
Highlights the frequency of data in one class interval compared to another
density scale
Height of each block = proportion in the block/length of the class interval
The area of the whole histogram on the density scale is one (or, in percentage. 100%)
Simple box plot
Graphic display of numerical summaries
5 number summary of data set - the middle 50% of the data in a box, the expected maximum and minimum in the whiskers, and determines any outliers.
comparative box plot
splits up a quantitative variable by a qualitative variable.
Scatter plot
Examines the relationship between 2 quantitative variables.
Heat map
useful when a contingency table is not practical due to too many different values.
end point convention
If an interval contains the left endpoint but excludes the right endpoint, then 18 year old would be counted in [18,25) not [0,18)
crowding
high density within a class interval
Advantages of numerical summaries
A numerical summary reduces all the data to one simple number (“statistic”)
Precise number, less disagreement
Sample mean
unique point at which the data is balanced.
i.e. the numbers to the left of the mean are balanced by the numbers to the right of the mean.
Sample median
the middle data point, when the observations are ordered from smallest to largest.
Robust
Sample median is said to be robust and is a good summary for skewed data as it is not affected by outliers
compareing sample mean and median
The difference between the sample mean and the sample median can be an indication of the shape of the data.
For symmetric data, we expect the sample mean
to be the same as the sample median
For left skewed data, we expect the sample mean
to be smaller than the sample median
For right skewed data, we expect the sample mean
to be larger than the sample median
Limitations of sample mean and median
need to be paired with a measure of spread.
Root Mean Square (RMS).
measures the average of a set of numbers, regardless of the signs.
Standard deviation
measures the spread of the data
(average of the gaps)
Population standard deviation
RMS of gaps from the sample mean
Population standard deviation
RMS of gaps from the sample mean
Standard units
= (data point - mean) / SD
IQR
Range of the middle 50% of the data
Q3 - Q1
1st quartile
25% percentile
3rd quartile
75% percentile
Lower threshold on boxplot
Q1 - 1.5(IQR)
Upper threshold on boxplot
Q3 + 1.5(IQR)
Coefficient of variance (CV)
combines the mean and standard deviation into one summary (SD/mean)
quartile
split data into:
min, Q1, median, Q3, max
quantile
points in a distribution that relate to the rank order of values in that distribution
The set of q-quantiles divides the data into q equal size sets (in terms of percentage of data).
percentile
100-quantile