Lecture 2 - Describing & Summarising Data + Normal Distribution Flashcards
measures of central tendency:
mode (most frequent value), arithmetic mean (n) & the median (middle value in ranked dataset)
what measure of central tendency is affected the most by extreme values?
the mean is affected most by extreme values, the median would not be affected as much
how do you round with your mean values?
you always round your mean values to one decimal place (e.g. 4.988 —> 5.000)
what do histograms primarily show?
frequency
what does a positive skew graph look like?
left-slanted bell shape
what does a negative skew graph look like?
right-slanted bell shape
the more variables in our data…
… the less certain we can be about the estimates from the data, such as the mean
sum of squares:
total sum of squares = sum of all observations ( value in a sample - mean value of a sample)^2
what is the problem with the sum of squares equation?
the more data points you have, the bigger the sum of squares value will be
unreliability is proportional to:
variance
standard deviation equation:
standard deviation = √sum of (each value - mean)^2 / size of population
what does standard error of the mean calculate and how does it differ from standard deviation?
standard error calculates the scatter of the mean values, whereas the standard deviation is the scatter of the raw data values (observations)
Two Standard Error rules of thumb:
1) standard error is a measure of how confident we are that our sample mean is close to the population mean
2) in 95.5% of cases the population mean will fall within ca. 2 standard errors of the sample mean
Gaussian Distribution:
same as normal distribution it is a common continuous probability distribution
it is bell shaped asymptotic at the extremes and symmetrical around the mean with no skew: mean = median - mode
area under the curve is directly proportional to the relative frequency of observations and their probability (p)
what is the Gaussian (Normal) Distribution important for?
statistical analysis
describe the features of a box-&-whisker plot:
central line is median, the top line in the box is the 1st quartile, bottom line in the box is the 3rd quartile and the whole box itself is the interquartile range with the whiskers being the largest and smallest data values
IQR equation:
IQR = [3rd Quartile] - [1st Quartile]
what does the location of the median within a box plot give information regarding?
the placement of a median within the box plot gives information regarding skewness in a dataset
what are the variabilities and uncertainties for the following central tendencies?
1) mean
2) median
mean = variance, SD, SE of the mean. 95% confidence interval
median = interquartile range
standard error of the mean calculation:
standard error = SD / √No. of samples
continuous variable:
values within a range, can be measured (e.g. size: 130cm, 27cm etc)
discrete variable:
fixed values, integer, can be counted (e.g. no. of chromosomes)
ordinal variable:
n factor levels with implicit order (e.g. size class: small, medium & large)
nominal variable:
n factor levels without implicit order (e.g. eye colour: grey, blue, brown etc / treatment: sham vs. testosterone)
two types of numerical (quantitative) variable:
continuous (within a range, measured) and discrete (in a range, counted)
two types of categorical (qualitative) variable:
nominal (n factor levels without implicit order: eye colour, testosterone vs sham) and ordinal (n factor levels with implicit order: small, medium, large)