Week 12 - Descriptive statistics Flashcards
What are numerical measures of descriptive statistics?
measures of central tendency (location) and measures of dispersion (variability)
What are sample statistics?
If the measures are computed for data from a sample
What are population parameters?
If the measures are computed for data from a population
What is a sample statistic referred to?
as the point estimator of the corresponding population parameter
What are the 7 measures of location?
- Mean
- Median
- Mode
- Weighted Mean
- Geometric Mean
- Percentiles
- Quartiles
What is the mean of a data set?
the average of all the data values
What is the sample mean?
The sample mean xΜ is a point estimate of the population mean m
What is the mean equation?
xΜ = βx_i/ n
numerator - sum of the values of the n observations
denominator - number of observations in the sample
What is the median of a data set?
is the value in the middle when the data items are arranged in ascending order
When is the mean the preferred measure of central location?
Whenever a data set has extreme values
When is the median most often reported for out of the measure of location?
annual income and property value data
A few extremely large incomes or property values can inflate the mean
How do we calculate the mean for an odd number of observations?
Say we have the following 7 observations:
Sort them in ascending order:
Median is the middle value: 19
How do we calculate the mean for an even number of observations?
Even number of observations:
Say we have 8 observations:
Sort them in ascending order:
Median is the average of the middle two values: (19 + 26)/2 = 22.5
Where are the mean and median on a symmetrical diagram?
equal at the middle
Where are the mean and median on a left skew diagram?
mode is at the top, going down the tail is median then mean
Where are the mean and median on a right skew diagram?
mode is at the top, going down the tail is median then mean
What is the mode?
The mode of a data set is the value that occurs with greatest frequency.
The greatest frequency can occur at two or more different values
What is bimodal data?
If the data have exactly two modes
What is multimodal data?
If the data have more than two modes
What is tthe weighted mean?
When the mean is computed by giving each data value a weight that reflects its importance
When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value
What is the weighted mean equation?
π₯Μ = (β π€_π x π₯_π)/ (βπ€_π )
x_i = value of observation i
w_i = weight for observation i
What is value weighted?
a type of weighted mean where the weights are based on the values themselves rather than being assigned separately
What is equal weighted return?
imple average of all returns, giving each asset or component the same importance, regardless of size or value. This is in contrast to a value-weighted return, where larger values (e.g., market capitalization) carry more weight
What is value weighted return equation?
value_x x r_x + value_y x r_y / value_x + value_y
What is equal weighted return equation?
=βX_i / n
X_i = individual returns
n = number of assets or components
What is a portfolio return?
A portfolio return is the weighted average return of individual assets in the portfolio
usually equal the value weighted return
When is the geometric mean most appropriate to use?
most appropriate in situations where the data items to be summarised result from a ratio-type calculation, such as with growth rates or index numbers
calculated by multiplying all the numbers together and then taking the nth root of the product, where n is the total number of values
What is a percentile?
provides information about how the data are spread over the interval from the smallest value to the largest value
Admission test scores for colleges and universities are frequently reported in terms of percentiles
What is the pΛth percentile of a data set?
a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more.
10th percentile of a data set is a value such that at least 10% of the items are less than or equal to 90% of the items
How to calculate a percentile?
Arrange the Data: Sort the data set in ascending order.
Determine the Position (i):
Calculate the position using the formula:
βπ = (p/100) x n where p is the desired percentile and n the number of observations
Locate the Percentile:
If π is an integer, the p-th percentile is the average of the values at positions π and π +1
If π is not an integer, round up to the next whole number, and the p-th percentile is the value at this position.
Example of percentile calculation
Consider a data set: 7, 10, 15, 20, 25.
To find the 40th percentile:
Arrange the Data: The data is already in ascending order.
Determine the Position (i):
p=40
n=5
π = (40/100)Γ5 = 2
Locate the Percentile:
Since π=2 is an integer, the 40th percentile is the average of the values at positions 2 and 3.
Values at positions 2 and 3 are 10 and 15, respectively.
40thpercentile = (10 + 15)/2=12.5
Therefore, the 40th percentile of this data set is 12.5.
What are quartiles?
specific percentiles
first quartile = 25th percentile
second quartile = 50th percentile = median
third quartile = 75th percentile
What does measures of variability (dispersion) help up to understand?
how data points spread out from the centre (mean or median). This is useful in decision-making, such as evaluating supplier delivery times, stock price volatility, or quality control in manufacturing.
What are the 5 main measures of variability (dispersion)?
- Range
- Interquartile Range (IQR)
- Variance
- Standard Deviation
- Coefficient of Variation (CV%)
What is the range?
The range of a data set is the difference between the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest data values.
How to calculate the range?
Range = largest value - smallest value
What is the interquartile range?
The interquartile range of a data set is the difference between the third quartile and the first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data values.
How to calculate the interquartile range?
IQR = 3rd quartile - 1st quartile
How is a box plot drawn?
with its ends located at the 1st and 3rd quartiles
a vertical line is drawn in the box at the location of the median (second quartile)
Dashed lines are drawn from the ends of the box to the smallest and largest data values inside the limits.
Data outside these limits are considered outliers
The locations of each outlier is shown with the symbol * .
How to calculate the lower limit and upper limit for a box plot for outliers?
the lower limit is located 1.5(IQR) below Q1
the upper limit is located 1.5(IQR) above Q3
What is the variance?
The variance is the average of the squared differences between each data value and the mean.
The variance is a measure of variability that utilises all the data.
It is based on the difference between the value of each observation (xi) and the mean (π₯Μ for a sample, Β΅ for a population).
What is the variance equation?
sΛ2 = [ β(x_i - xΜ)Λ2]/ (n-1)
for a sample
x_i - each individual data point
xΜ - sample mean
n - sample size
ΟΛ2 = [ β(π₯_π βΒ΅)Λ2]/ N
for a population
x_i - each individual data point
π - population mean
π - total number of data points in the population
What is the standard deviation?
set is the positive square root of the variance.
It is measured in the same units as the data, making it more easily interpreted than the variance.
How to calculate standard deviation?
s = βsΛ2 = β[ β(x_i - xΜ)Λ2]/ (n-1)
for a sample
x_i - each individual data point
xΜ - sample mean
n - sample size
Ο = βΟΛ2 = β[ β(π₯_π βΒ΅)Λ2]/ N
for a population
x_i - each individual data point
π - population mean
π - total number of data points in the population
What is the coefficient of variation?
how large the standard deviation is in relation to the mean
How do you calculate the coefficient of variation?
CV = (s/xΜ) x 100%
for a sample
s - sample standard
xΜ - sample mean
CV = (Ο/π) x 100%
for a population
Ο = population standard deviation
π = population mean
Show an example of variance, standard deviation and coefficient of variation linked together
Variance: π ^2= (β(π₯_π β xΜ)Λ2 )/ (πβ1) = 2,996.16
Standard Deviation: π = β(π Λ2 )= β2996.16 = 54.74
Coefficient of variation: (s/xΜ) x 100% =(54.74/490.84) x 100% = 11.15%
the standard deviation is about 11% of the mean
What are the 2 measures of association between 2 variables?
- covariance
- correlation coefficient
What is the covariance a measure of?
a measure of the linear association between two variables.
Positive values indicate a positive relationship. Negative values indicate a negative relationship.
How do you calculate the covariance?
π _XY= [ β(π₯_π β xΜ)(y_i - Θ³)]/ (πβ1)
for samples
βx_i, y_i - individual data points for variables
xΜ, Θ³ - means of variables X and Y
n - sample size
Ο_XY = [ β(π₯_π β Β΅_π)(y_i - Β΅_π)]/ π
for populations
Β΅_x, Β΅_y - populations means of X and Y
n - population size
What is the correlation coefficient?
quantifies the strength and direction of the linear relationship between two variables (not necessarily causation, just because two variables are highly correlated, it does not mean that one variable is the cause of the other)
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear relationship.
Values near +1 indicate a strong positive linear relationship
How to calculate correlation coefficient?
r_XY = S_XY / (S_X)(S_Y)
= [ β(π₯_π β xΜ)(y_i - Θ³)] / β(β(x_i - xΜ)Λ2)(β(y_i - Θ³)Λ2)
for samples
x_iβ, y_i - individual data points for variables
xΜ, Θ³ - means of variables X and Y
n - number of data points
p_XY = Ο_XY/ (Ο_X)(Ο_Y)
= [ β(π₯_π β ΞΌ_x)(y_i - ΞΌ_y)] / β(β(x_i - ΞΌ_x)Λ2)(β(y_i - ΞΌ_y)Λ2)
for populations
x_i, y_iβ - individual data points for variables X and Y
ΞΌ_x, ΞΌ_y - population means for X and Y
n - population size (number of data points)
What are the different correlation coefficients?
Positive Correlation: If r>0, as one variable increases, the other tends to increase.
Negative Correlation: If r<0, as one variable increases, the other tends to decrease.
No Correlation: If r=0, there is no linear relationship between the two variables.
Strength:
Strong: r near 1 or -1
Weak: r near 0
What makes correlation coefficients perfect?
Perfect Positive Correlation (r=1): A straight line with a positive slope (both variables increase together in perfect proportion).
Perfect Negative Correlation (r=β1): A straight line with a negative slope (one variable increases as the other decreases in perfect proportion).
No Correlation (r=0): No linear pattern in the data.
Example of covariance and correlation coefficient calculation linked together
Sample covariance: π _XY= [ β(π₯_π β xΜ)(y_i - Θ³)]/ (πβ1) = -35.4/ 6-1 = -7.08
Sample correlation coefficient: r_XY = S_XY / (S_X)(S_Y) = -7.08/ (8.2192)(0.8944) = -0.9631