Chapter 3 Flashcards
What are the measures of location
- Mean
- Weighted Mean
- Geometric Mean
- Median
- Mode
- Percentiles
- Quartiles
Describe the mean
- most important measure of location
- also called average
- provides a measure of central location
- most commonly used
Describe the weighted mean
- arithmetic average where some data values contribute more than others
Describe the Geometric Mean
- finding the nth root of the product of n vales
- often used to analyze growth rates in financial data
- in these cases, arithmetic mean will provide misleading results
What format must the the number be in in order to calculate the Geometric mean?
- can’t be in a percent must be an integer
- if it is a percent, and it is negative, 1- the number
- if it is a percent and it is positive , 1 + the number
example: -40% converts to -0.40 = 1-.40 = 0.6
then you can use the formula
Describe the median
- sometimes the more preferred method because it remove outliers
- the middle of a sorted list of data values
- arranged in ascending order
- odd # values = median is the middle number
- even # values = median is the mean of 2 central data values
Describe the mode
- data value that occurs most often (greatest frequency)
- 2 modes - bimodal
- more than 2 modes - multimodal
- don’t report the mode b/c listing 3 or more modes is not helpful in describing the location of data
Describe percentiles
- how data is spread over the interval from smallest value to largest
3 steps
1. arrange the data in ascending order
2. compute an index
3. if i is not an integer - round up
if i is an interger, the pth percentile is the avg. of the values in i & i+1
What is the formula for to find the percentile of x?
of data points than x / total # of data points
What are the measures of variability
- range
- interquartile range
- variance
- standard deviation
- coefficient of variation
What is another name for measures of variability
Measures of dispersion
What is the measures of variability
spread of data
Describe Range
- Largest value - smallest value
- seldom used as the only measure
b/c it is based on 2 observations and it is highly influenced by extreme values
Describe Interquartile range
IQR = Q3- Q1
- overcomes the dependency of extreme values
- difference b/w the 3rd and 1st quartile
- it is the range for the middle 50% of the data
Describe Variance
- utilizes ALL data
- based on difference b/w EACH observation (xi) and the mean
- called deviation about the mean
Describe Standard deviation
- positive square root of variance
- easier to interrupt than the variance b/c SD is measured in the same units as the data
- commonly used measure of risk associated with investing in stock and stock funds
Describe the Coefficient of Variation
- when interested in descriptive statistic that indicates how LARGE the SD is relative to the MEAN
- usually expressed as a percent
- tells us that the sample SD x% of the value of the sample mean
- useful for comparing the variability of variables that have different SD and Different means
What is the formula for Coefficient of Variation
(SD/Mean) x 100
What is MAE
Mean absolute error
What is MAE
- sum the absolute values of the deviations of the observations about the mean & divide it by # of observations
What are the measures of distribution of shape
- Skewness
- z-scores
- Chebyshev’s Theorem
- Empirical Rule
What kind of skewness is there and describe each one
- Skewed Left
- skewness is negative
- mean is usually less than the median - Skewed Right
- skewness is positive
- mean is usually more than the median
Symmetrical
- skewness is zero
- mean and median are equal
describe z-scores
- to find relative location of values w/in a data set
- how far a particular value is from the mean
- also called standard value
- using mean and SD we can find the relative location of any observation
Describe Chebyshev’s Theorem
- allows us to make statements about the proportion of data values that must be w/in a specified # of SD of the mean
- at least 75% of the data values are w/in 2 SD of the mean
- at least 89% of the data values are w/in 3 SD of the mean
- at least 94% of the data values are w/in 4 SD of the mean
Describe Empirical Rule
- based on normal prob. distribution
- used for symmetrical bell shaped distribution
- to determine % of data values that must be w/in a specified # of SD from the mean
- approx. 68% of the data values will be w/in 1 SD of the mean
- approx. 95% of the data values will be w/in 2 SD fo the mean
- Almost all of the data values will be w/in 3 SD of the mean
What can you use to detect outliers
- Z-scores - use empirical rule
- treat any data values with a score of less than -3 or more than +3 as outlier - based on 1st and 4rd Quartiles and IQR
a. compute lower and upper limits
Lower limit = Q1 - 1.5(IQR)
Upper limit = Q3 + 1.5 (IQR)
- if the value is less than the lower limit or greater than the upper limit, treat it as an outlier
What are the 5 number summaries
- used to summarize the data
1. Smallest value
2. FIrst Quartile Q1
3. Second Quartile Q2 (Median)
4. Third Quartile Q3
5. Largest Value
What is a box plot useful for
- provides a convenient visual display of several characteristics of a data set
- based on 5 # summary
- need to compute IQR = Q3 - Q1
What are the advantages of a box plot
- easy to use
- few calculations
- no need to calculate mean and SD
What are the steps in constructing a box plot
- a box is drawn with the ends of the box located at 1st and 3rd quartiles
- this box contains 50% of the data - a vertical line is drawn in the box at the location of the median
- using IQR, limits are located at 1.5(IQR) below Q1 and 1.5 IQR, above Q3
- data outside these limits are outliers - Dash lines are called whiskers
- drawn from the end of the box to the smallest and largest values in step 3 - location of each outlier is shown with *
Note: generally upper and lower limits are not drawn on the box