L03 Descriptive Stats Flashcards
Descriptive statistics
Describe data through tables and graphs
Summarize through measures of central tendency and measures of spread
Two types of data
Discrete - set of fixed values (ordinal)
Continuous - any fractional value within a given range
Interval and ratio: either type
Represent frequencies of occurrence - nominal data
Frequency table or graph (bar graph; y axis n or %)
Represent frequencies of occurrence - discrete data
n or %; cumulative n or cumulative %
Frequency/Cumulative frequency table
Graph - bar graph
Frequency ranges for too many values
Represent frequencies of occurrence - continuous data
Frequency table/ Cumulative frequency table
Graph - histogram
- frequency diagram/line chart/frequency polygon
Frequency ranges
Frequency ranges
When frequencies of all possible score is not feasible
Ranges or intervals depending on number of samples
More ranges: better visualisation
Central tendency
Summary of data through a single value that reflects the centre of distribution of data
3 measures: mean, median, mode
Important in comparing two populations
Mode
Most common category or score - that occurs most frequently
Generally used only for nominal data
Median
Middle score/value/category when all values are placed in ascending order
Best for ordinal data
Also used for skewed interval/ratio data (insensitive to outliers)
Mean
Sum of all scores divided by the number of scores
Influenced heavily by outliers/extreme scores
Best for normally distributed data
Mode pros and cons
+ can be used for categorical data
+ Always gives a real data value
+ Not affected by extremes
- can be more than one value (bimodal, multimodal)
- varies depending on bin size
- can be affected by a few number of cases
Median pros and cons
+ Insensitive to outliers
+ relatively unaffected by skews (than mean)
+ Often gives real data value
- Ignores a lot of data
- Not easy to calculate without a computer
- Cannot do calculations to it
- more affected by sampling fluctuations
Mean pros and cons
+ Uses all the data
+ tends to be stable in different samples
- Very sensitive to outliers and skews
- Doesn’t always give a meaningful value
Measures of spread or Dispersion
Variations in a dataset from the measure of central tendency
Measure of spread of Mode
None
Measure of spread of Median
Distance-based measures of spread
Range, Interquartile range
Measures of spread of Mean
Centre-based measures of spread
Variance, Standard deviation
Distance-based measures of spread
Report these with median
Range
Interquartile range
Range
Highest value - Lowest value
Very sensitive to outliers
Interquartile range
Range of middle 50% of scores
Q3 - Q1
Quartile
Lowest score needed to be included in a given quarter of the population
(Cut the set down the median Q2 and find medians of values to left and right - median not included)
Semi-quartile range
Mid-quartile range
Semi-QR = IQR/2 = (Q3 - Q1)/2 Mid-QR = (Q3 + Q1)/2
IQR pros and cons
(Like median) \+ Less sensitive to outliers \+ Few assumptions - Hard to calculate by hand for large datasets - Doesn't use all the data
Centre-based measures of dispersion
Variance σ^2
Standard deviation σ