descriptive statistics (w4) Flashcards
what things can be sued to describe data
histograms, central tendency, spread, shape, outliers, box plots
what are central tendencies
mode, median, mean
what are spreads of data
quantile/quartile/percentile
variance and standard deviation
z-score
what is the shape in terms of describing data
skewness, kurtosis
purpose of a histogram
to visualise how data are distributed
what is the mode, what types of variables can it be used for
most occurring answer (highest stack), can be multiple modes
all types of variables
what is the median, what types of variables can it be used for
a middle value dividing data into 2 groups with the same number (middle value)
only ordinal, interval, ratio (ordered variables)
what is the mean, what types of variables can it be used for
= ∑ coin value
number of coins
(sum of all value/total number)
only interval and ratio
which central tendency (mean, median, mode) would an outlier affect most
mean and it depends on actual values
why need spread of data
distributions can have same mean/median but one may be much more spread
how to calculate spread
divide data into sections containing same number of data
what are quantiles
cut off points diving equal sections of data, for N sections they are called N-quantiles (N-1 values)
what are quartiles
when there are 4 sections in total, they are called quartiles (1st-3rd), median is 2nd quartile
what are percentiles
when there are 100 sections in total, they are called percentiles (1st-99th), median is 50th percentile
what is the 2nd moment
how hard to spin data around mean
= ∑ [distance from mean]2 to each data point
/number of data points
= variance
what is standard deviation
square root of variance, the standard distance from the mean
what does mean +- SD show
where the centre is and how spread data points are around it
what is the z-score and what’s it for
given SD, distance can be describe as a ratio with respect to SD, which is the z-score
enables fair comparisons of deviations
who’s height is more deviated from the mean:
female: 5.3 +- 0.3ft male: 5.8 +- 0.4ft
mary is 5.6, dave is 6.1
mary: (5.6-5.3)/0.3 = 1
dave: (6.1-5.8)/0.4 = 0.75
mary’s height is more deviated from the mean compared to dave
what is skewness
measures degree of asymmetry
what is 3rd moment, how do you make it dimensionless
= ∑ [distance from mean]3 to each data point
/number of data points
divide it by SD^3
ie: skewness = 3rd moment/SD^3
what does: 0 and high skewness mean
0 = data symmetrically distributed
high = distribution highly symmetrical
what does + and - skewness mean
+ data skewed left
- data skewed right
what is kurtosis
the sharpness of graph
what it the 4th moment, how do you make it dimensionless
=∑ [distance from mean]4 to each data point
/number of data points
divide it by SD^4
kurtosis = 4th moment/SD^4
what is excess kurtosis
kurtosis is always +ve but normally subtract 3
what are the 1st, 2nd, 3rd and 4th moment around the mean
1st - mean
2nd - variance (how spread)
3rd - skewness (how skewed/distorted)
4th - kurtosis (how thin)
what are outliers
extreme values relative to bulk of values in a data set
what can outliers be due to
inaccuracies in data processing, problems with methodology (measures, instruments, participants not following instructions), an actual extreme value from an unusual participant
how to detect outliers (2 ways)
based on z-score
based on IQR (inter quartile range) (width between 1st and 3rd quartile)
how to detect outlier based on z-score
outlier if z-score is more then 3 or less than -3
ie: distance from mean is more than 3x SD
how to detect outlier based on IQR
outlier if value is greater than 1.5 IQR above the 3rd quartile or smaller than 1.5 IQR below the 2nd quartile
what is a box plot
plot summarising quartile based statistics of a data set
what does a box plot include
location of quartiles
range of data excluding outliers
outliers detected by quartile