Numerical Descriptive Statistics Flashcards
4 measures used to describe data
Central tendency
Quartiles
Variation
Shape
4 measures of central tendency
Arithmetic mean
Median
Mode
Geometric mean
5 measures of variation
Range Interquartile range Variance Standard deviation Coefficient of variation
1 measure of shape
Skewness
What’s required to make an informed decision
Central tendency (location), spread and shape need to be known and all 3 must be present for complete information. This allows for you to make an informed decision.
Arithmetic mean
Arithmetic mean is summing up the observations and dividing by the number of observations.
Median and mode extreme values
The median is not sensitive to extreme values and the mean is sensitive to extreme values.
Sigma
Sigma is short for adding up the values
Median
In an ordered array, the median is the middle number (50% above and 50%below). It’s main advantage over the arithmetic mean is that it is not affected by extreme values.
Location of the median
median = n+1/2 ranked value. This is not the value of the median, only the position of the median in the ranked data. If the number of observations in the data set is odd, the median is the middle ranked value. If the number of values in the data set is even, the median is the mean (average) of the two middle ranked values.
Mode
A measure of central tendency. Value that occurs most often (the most frequent). Not affected by extreme values. Never use the mode by itself, always use in conjunction with median or mean. Unlike mean and median, there may be no unique (single) mode for a given data set. Used for either numerical or categorical (nominal) data.
What measure is best to use
As the sample size gets bigger the influence of extreme values deteriorates. The mean is generally used most often, unless extreme values (outliers) exist. The median is often used, since it is not sensitive to extreme values. The mode is usually the least used of the three. Since we have an obvious outlier ($2,000,000), it makes sense to use the median in this instance. Most housing prices are now reported as median housing prices in Australian newspapers due to possible outliers.
Quartiles
Quartiles split the ranked data into four segments, with an equal number of values per segment. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile, Q3
Finding the quartile
Similar to the median, we find a quartile by determining the value in the appropriate position in the ranked data, where: First quartile position: Q1 = (n+1)/4 Second quartile position: Q2 = (n+1)/2 (the median) Third quartile position: Q3 = 3(n+1)/4 where n is the number of observed values (sample size)
Quartile rule 1
If the result is an integer, then the quartile is equal to the ranked value. For example, if the sample size is n = 7, the first quartile, , is equal to the (7+1)/4 = second ranked value
Quartile rule 2
If the result is a fractional half (1.5, 2.5, 3.5, etc.), then the quartile is equal to the mean of the corresponding ranked values. For example, if the sample size is n = 9, the first quartile, , is equal to the (9+1)/4 = 2.5 ranked value, halfway between the second and the third ranked values.
Quartile rule 3
If the result is neither an integer nor a fractional half, round the result to the nearest integer and select that ranked value. For example, if the sample size is n = 10, the first quartile, , is equal to the (10+1)/4 = 2.75 ranked value. Round 2.75 to 3 and use the third ranked value.
Measures of variation
Measures of variation give information on the spread or variability of the data values
Interquartile range
Like the median and Q1 and Q2, the IQR is a resistant summary measure (resistant to the presence of extreme values) Eliminates outlier problems by using the interquartile range, as high- and low-valued observations are removed from calculations. IQR = 3rd quartile – 1st quartile. IQR = Q3 - Q1
Sample variance
Measures average scatter around the mean. Units are also squared. This measure tells you the average deviation of the mean. The reason we square the values is because some are negative and some are positive. The sample variance is the squared average difference between the mean.
Sample standard deviation
Most commonly used measure of variation. Shows variation about the mean. Has the same units as the original data. It can be considered a measure of uncertainty.
Advantages of variance and standard deviation
Each value in the data set is used in the calculation. Values far from the mean are given extra weight as deviations from the mean are squared.
Disadvantages of variation and standard deviation
Sensitive to extreme values (outliers). Measures of absolute variation not relative variation.
Differences between sample and population in regards to standard deviation and variance
When calculating variance and standard deviation for a sample n-1 is used and when calculating for a population N is used
Coefficient of variation
Measures relative variation i.e. shows variation relative to mean. Can be used to compare two or more sets of data measured in different units. Always expressed as percentage (%)