Describing data Flashcards
Central tendency
- the middle of the collected data
- mean, median and mode are all measures of central tendency
Mean
- sum of scores divided by number of scores
- influenced by all available scores
- easily influenced by outliers
- the more samples, the closer the mean comes to the true population mean
Geometric mean
- if individual observations are log transformed, then averaged and then back-transformed using antilog then the geometric mean is found
- this closer to the medican and has symmetrical distribution
Weighted mean
- used when some observations are more or less valuable than others when reaching a summary measure
- individual values are multiplied by weights (constants) attached to them before averaging
Median
- the point value that divides a distribution into two equal sized groups
- half score fall below, half above
- aka 50th percentile
- not as influenced by extreme scores as mean but it ignores most of the available information
- it is preferable for nominal data when treated as values (not as counts)
Mode
- the most commonly occurring value in a distribution
- crude measure, mostly used for nominal data (frequencies)
- also useful for ordinal data to understand the most common rating obtained on a likert scale
- similar to medial but ignores most of the available information
- in bimodal distribution two values occur equally frequently
Skew
- in normal symmetric distribution, mean, median and mode are equal
- positive skew- higher extreme outliers are present, making mean higher than median
- negative skew, lower value outliers lead to mean being less than median and left tail being longer than right
Range
- difference between the highest and lowest scores in a distribution
- easily determined when the data is arranged in a rank order (ascending or descending)
- very distorted by extreme scores
Interquartile range
-refers to the difference between 75th and 25th percentile values
Variance
=sum of squared differences of individual observations from mean/(number of observations-1)
- N-1 is degrees of freedom
- variance is high when scores are widely scattered
- low variance when scores cluster around mean
- expressed as squared units of the original measure
Standard deviation
- square root of variance
- measures dispersion
- estimates the variability of the sample and tells us the distribution of individual data points around the mean
Coefficient variation
- obtained by dividing the standard deviation by the mean and expressing this as a percentage
- measure of relative spread of the data
Standard error of the mean
- standard deviation divided by square root of sample size
- larger sample provides less SE
- describes precision and uncertainty of how the sample represents the underlying population
- SE is always smaller than SD
- shows us how precise our estimate of the mean is
Box and whisker plot
- whiskers denote the range
- black horixontal line is the median
- rectangle is the end of 1st quartile to beginning of the 4th quartile
Stem and leaf plot
- first few digits of numerical obervations are plotted along a vertical axis and then single numbers are added to represent individual values
e. g
1: 1 2 3 4 5
2: 2254
3: 663999
Normal distribution
- 68% of data will lie within 1 SD
- 95% will lie within 2 SD
- 99% will lie within 3 SD
- kurtosis (flatness of the curve=0
- tail of curve reaches close to the X axis but never touches it
- SD’s are in omicrons
Standard normal distribution
-normal distibution whose mean is 0 and SD is 1 unit
Standard normal deviate
-expression denoted by z
z= (random value x-mean)/ SD
High mean
-mean shifts the curve to the right
Low mean
-shifts curve to left
Higher SD
-decreases the peakedness of the curve
Lesser SD
-increases the peakedness of the curve
Leptokurtic curve
- sharp peak
- high kurtosis means a high peak is near the mean
- low curtosis tend to have a flat-top near the mean rather than a sharp peak
How to calculate SD
- First work out variance
- subtract mean from each individual score
- square them
- add the together
- then divide by N(number of samples)-1
- then square route the variance to get the SD