Lecture 4 REVISED Flashcards
continuous variable
can take on any value in an interval
e.g., worker’s hourly income can take on any value between 0 and infinity
discrete variable
can only take on set, distinct values within in interval
e.g., how many people chose blue as their favourite colour can only be whole number values
what levels of measurement are required for continuous variables?
interval or ratio
rectangle in a histogram is called a…
bin
how does a discrete/continuous data distribution look on a graph?
discrete: bars
continuous: curve
frequency distribution is…
a tabular summary of a dataset showing the frequency of items in each class
symmetric, skewness, kurtosis in frequency distributions?
symmetric: distribution is split into two identical halves
skewness: level of asymmetry in which an elongated tail extends
kurtosis: degree of peakedness/steepness in a distribution
when a distribution is perfectly symmetrical, what is the relationship between the mean and median?
mean and median are the same values
when a distribution is skewed, this isn’t the case
why does the median tend to be more representative than the mean?
because if a distribution isn’t symmetrical, an outlier may skew the mean/average
where is the mode in a frequency distribution?
the peak
what formula is used to find the position of the median value?
(n+1) / 2
what is the formula to calculate standard deviation?
- subtract the mean from each value
- square all the deviations and add them together
- divide this by (n-1)
- square root this figure
what does standard deviation tell us about the dataset?
how close each value is from the mean
small standard deviation = low amount of variability, values are close to the mean
high standard deviation = high variability, values are far from the mean
variance relationship with standard deviation?
standard deviation is the square root of the variance
density curve
an idealised description of a data distribution
describes the overall pattern of a distribution
disadvantage of variance for practical applications?
its units differ from the units of the variable
hence why standard deviation is more commonly reported as a measure of dispersion
if the dataset is a sample/population, how is the standard deviation denoted and calculated??
sample: denoted s, calculated by dividing the squared deviations by n-1
population: denoted sigma, calculated by dividing the squared deviations by n
mean absolute deviation (MAD)
measures the absolute distance/deviation of values in a dataset from the mean
how is MAD calculated in a sample/population?
divide the sum of the deviations by the number of data points
what does MAD indicate?
how spread out data is
percentile
describes the percentage of data values that fall at or below another data value
how to calculate percentiles?
(p/100)n
percentile in question divided by 100 multiplied by the number of variables in the dataset
quartiles
specific percentiles dividing the data into four parts
first/lower quartile corresponds to the 25th percentile (Q1)
second quartile (median) corresponds to the 50th percentile (Q2)
third (upper) quartile corresponds to the 75th percentile
fourth quartile corresponds to the maximum
interquartile range
the difference between the third and first quartile
Q3 - Q1
the range for the middle 50% of the data
overcomes the sensitivity to extreme data values