Numerical Descriptive Statistics Flashcards
4 measures used to describe data
Central tendency
Quartiles
Variation
Shape
4 measures of central tendency
Arithmetic mean
Median
Mode
Geometric mean
5 measures of variation
Range Interquartile range Variance Standard deviation Coefficient of variation
1 measure of shape
Skewness
What’s required to make an informed decision
Central tendency (location), spread and shape need to be known and all 3 must be present for complete information. This allows for you to make an informed decision.
Arithmetic mean
Arithmetic mean is summing up the observations and dividing by the number of observations.
Median and mode extreme values
The median is not sensitive to extreme values and the mean is sensitive to extreme values.
Sigma
Sigma is short for adding up the values
Median
In an ordered array, the median is the middle number (50% above and 50%below). It’s main advantage over the arithmetic mean is that it is not affected by extreme values.
Location of the median
median = n+1/2 ranked value. This is not the value of the median, only the position of the median in the ranked data. If the number of observations in the data set is odd, the median is the middle ranked value. If the number of values in the data set is even, the median is the mean (average) of the two middle ranked values.
Mode
A measure of central tendency. Value that occurs most often (the most frequent). Not affected by extreme values. Never use the mode by itself, always use in conjunction with median or mean. Unlike mean and median, there may be no unique (single) mode for a given data set. Used for either numerical or categorical (nominal) data.
What measure is best to use
As the sample size gets bigger the influence of extreme values deteriorates. The mean is generally used most often, unless extreme values (outliers) exist. The median is often used, since it is not sensitive to extreme values. The mode is usually the least used of the three. Since we have an obvious outlier ($2,000,000), it makes sense to use the median in this instance. Most housing prices are now reported as median housing prices in Australian newspapers due to possible outliers.
Quartiles
Quartiles split the ranked data into four segments, with an equal number of values per segment. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. The second quartile, Q2, is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile, Q3
Finding the quartile
Similar to the median, we find a quartile by determining the value in the appropriate position in the ranked data, where: First quartile position: Q1 = (n+1)/4 Second quartile position: Q2 = (n+1)/2 (the median) Third quartile position: Q3 = 3(n+1)/4 where n is the number of observed values (sample size)
Quartile rule 1
If the result is an integer, then the quartile is equal to the ranked value. For example, if the sample size is n = 7, the first quartile, , is equal to the (7+1)/4 = second ranked value
Quartile rule 2
If the result is a fractional half (1.5, 2.5, 3.5, etc.), then the quartile is equal to the mean of the corresponding ranked values. For example, if the sample size is n = 9, the first quartile, , is equal to the (9+1)/4 = 2.5 ranked value, halfway between the second and the third ranked values.
Quartile rule 3
If the result is neither an integer nor a fractional half, round the result to the nearest integer and select that ranked value. For example, if the sample size is n = 10, the first quartile, , is equal to the (10+1)/4 = 2.75 ranked value. Round 2.75 to 3 and use the third ranked value.
Measures of variation
Measures of variation give information on the spread or variability of the data values
Interquartile range
Like the median and Q1 and Q2, the IQR is a resistant summary measure (resistant to the presence of extreme values) Eliminates outlier problems by using the interquartile range, as high- and low-valued observations are removed from calculations. IQR = 3rd quartile – 1st quartile. IQR = Q3 - Q1
Sample variance
Measures average scatter around the mean. Units are also squared. This measure tells you the average deviation of the mean. The reason we square the values is because some are negative and some are positive. The sample variance is the squared average difference between the mean.
Sample standard deviation
Most commonly used measure of variation. Shows variation about the mean. Has the same units as the original data. It can be considered a measure of uncertainty.
Advantages of variance and standard deviation
Each value in the data set is used in the calculation. Values far from the mean are given extra weight as deviations from the mean are squared.
Disadvantages of variation and standard deviation
Sensitive to extreme values (outliers). Measures of absolute variation not relative variation.
Differences between sample and population in regards to standard deviation and variance
When calculating variance and standard deviation for a sample n-1 is used and when calculating for a population N is used
Coefficient of variation
Measures relative variation i.e. shows variation relative to mean. Can be used to compare two or more sets of data measured in different units. Always expressed as percentage (%)
The Z score
The difference between a given observation and the mean, divided by the standard deviation. A Z score of 2.0 means that a value is 2.0 standard deviations from the mean. A Z score above 3.0 or below -3.0 is considered an outlier
The shape of a distribution
Describes how data are distributed. Measures of shape are symmetric or skewed
Left skewed and right skewed
When the data is left or negatively skewed the distance between the q1 and q2 is greater than the distance between q2 and q3. The reverse applies for right or positively skewed data. If the data is symmetric the distances are the same
What does a box and whisker plot show
Box and whisker plot show location, spread and shape.
Numerical measures for a population
Population summary measures are called parameters. The population mean is the sum of the values in the population divided by the population size, N
Population variance
the average of the squared deviations of values from the mean
Population standard deviation
shows variation about the mean. is the square root of the population variance. has the same units as the original data
Arithmetic mean equation
Photo 1
Example of mean, median and mode
Photo 2
Quartile example
Photo 3
Measures of variation example
Photo 4
Range example
Photo 5
Range disadvantages
Photo 6
Sample variance equation
Photo 7
Sample standard deviation equation
Photo 8
Sample standard deviation example
Photo 9
Sample standard deviation graphed example
Photo 10
Comparing standard deviations
Photo 11
Coefficient of variation equation
Photo 12
Coefficient of variation example
Photo 13
The 3 shapes of a distribution
Photo 14
Using excel for descriptive statistics
Photos 15-16
Population mean equation
Photo 17
Empirical rule
Photos 18-19
Box and whisker plot
Photo 20
Distribution shape box and whisker plot
Photo 21
Covariance
The sample covariance measures the strength of the linear relationship between two numerical variables. Only concerned with the direction of the relationship. No causal effect is implied. Is affected by units of measurement
Covariance equation
Photo 22
Correlation
Measures the relative strength of the linear relationship between two variables
Correlation equation
Photo 23
Features of correlation coefficient
Also called Standardised Covariance i.e. invariant to units of measure. Ranges between –1 and 1. The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship. The closer to 0, the weaker the linear relationship
Scatter Plots of Data with Various Correlation Coefficients
Photo 24
Pitfalls and ethical issues
Data Analysis is objective
Data analysis is subjective
Objective
Should report the summary measures that best meet the assumptions about the data set
Subjective
Should be done in fair, neutral and transparent manner. Should document both good and bad results. Results should be presented in a fair, objective and neutral manner. Should not use inappropriate summary measures to distort facts. Do not fail to report pertinent findings even if such findings do not support original argument
IQR Example
Photo 25
Population variance and standard deviation equations
Photo 26
5 number summary
Numerical data summarised by quartiles. Xsmallest Q1 Median Q3 Xlargest