Topic 2 - Descriptive Statistics Flashcards
How many general ways are there to describe data (descriptive statistics) numerically for 1 variable?
2 ways
How can you describe data numerically?
1) Measures of Central Tendency- arithmetic mean, median, mode
2) Measures Variation or Dispersion- range, interquartile range, variance, standard deviation
How do you denote the population arithmetic mean?
Population mean denoted by mu
See image in notes
How do you denote the sample arithmetic mean?
x bar (x with horizontal line over the top)
How do denote the population size?
Capital/uppercase N
How do denote the sample size?
Lowercase n
How do you calculate the median?
Mid point of data ordered in ascending order
What is a negative of the mean as a measure of central tendency?
It is very sensitive to and is affected by outliers/extreme values
What is a negative of the median as a measure of central tendency?
Ignores all values apart from the middle values- doesn’t take into account values which are fairly higher or smaller than the middle values
How do calculate the median when there are an even number of ordered data items?
Take the mean of the middle of the middle 2 values e.g. the half way point of the 2 values
What are the advantages and disadvantages of the mode?
Advantages:
- not affected by extreme values
- can be used for either numerical (quantitative) or categorical (qualitative) data
Disadvantages:
- there may be no mode (if their are an equal number of all data items) or there may be several modes (if there is more than one data item which occurs most frequently)
What do measures of central tendency show?
Single value that attempts to describe a set of data by identifying the typical value within that set of data
What do measures of variance or dispersion show?
How spread out the data is- how variable it is- the dispersion of the data
What are the disadvantages of the range as a measure of dispersion/variance?
1) It ignores the way in which data is distributed e.g. the range won’t take into account whether there is an even distribution of data among all data items or whether data is concentrated in the low, middle or high end- it is only concerned with the lowest and highest value in the data set
2) It is also sensitive to outliers/extreme values- one extreme value will have a massive impact on the range
What are quartiles?
Quartiles split the ranked data into 4 segments with an equal number of values per segment
The point which marks the end of the 1st quartile is known as the lower quartile (Q1)
The point which marks the end of the 2nd quartile is known as the median (Q2)
The point which marks the end of the 3rd quartile is known as the upper quartile (Q3)
How do you calculate the interquartile range and what are its advantages?
Interquartile range (IQR) = 3rd quartile (Q3) – 1st quartile (Q1)
Advantages:
- can eliminate some outlier problems as only takes into account the middle 50% of data- here the range (interquartile range) is not likely to be affected by outliers/extreme values
Define population variance
The population variance is the exact average (exact because you are taking data directly from the population) of the squared deviations of values from the mean
The population variance is a PARAMETER of the population
How do you calculate the population variance and the population standard deviation?
Population variance denoted at an ‘o’ (sigma) squared but with a line connected to the top which moves over slightly right of the p- see image in notes
Population variance (o^2) = the sum of : [value of data x - population mean mu (u)]^2 / population size N
NOTE that population variance itself is denoted as sigma squared so the formula above gives you the population variance itself
Sigma on its own (so if you were to square root the population variance) would give you the standard deviation of the population
… the standard deviation of the population is denoted as sigma in its own (o looking thing)
Define sample variance
The sample variance is the average (approximately- because you are calculating from the sample which is supposed to represent the population and … isn’t calculated from the population directly) of the squared deviations of values from the mean
The sample variance is a STATISTIC of the sample
How do you calculate the sample variance and the sample standard deviation?
Sample variance (s^2) = the sum of : [value of data x - sample/arithmetic mean x bar]^2 / [sample size n – 1]
Note- sample variance denoted as small/lowercase ‘s^2’- REMEMBER sample variance itself is ‘s’ squared and the root of it so ‘s’ in its own would be equal to the standard deviation of the sample
… standard deviation of the sample is denoted as ‘s’ on its own
Why do you divide by (n - 1) in the sample variance and not by n?
Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population
Using “n-1” instead of “n” as the divisor corrects for that by making the result a little bit bigger
Note that the correction has a larger proportional effect when “n” is small than when it is large, which is what we want because when “n” is larger the sample mean is likely to be a good estimator of the population mean
Note- we divide here by [n - 1] instead of just N (seen in population variance) because we need to ensure that the sample variance is an unbiased estimator (average of the sample variances for all possible samples should equal the population variance) of the population variance
SEE IF MAKES SENSE AFTER LECTURE 2 LIVE AND ADD HERE ACCORDINGLY
What are the units of the population or sample standard deviation?
The units are the same as the original data e.g. if in litres then the standard deviation for the sample and the population is also in litres