Lecture 1 - Basics Flashcards
What is a parameter?
an attribute of a population, or relationship between populations or variables
What is a population?
Complete set of items that you want to draw inferences
What is a sample?
a random subset of population
exploratory vs confirmatory
Exploratory is generating hypotheses, confirmatory is testing hypotheses
how to calc SE
SD divided by the square root of sample number
type 1 error
false rejection of H0
type 2 error
false acceptance of H0
internal validity
extent to which you believe results of the study
external validity
extent to which results apply to real word
observational vs experimental
in an experiment, only one factor varies between conditions, so if theres a diff you can be certain it is casual
what is the mean? for norm dist
value that minimises the sum of SQUARED deviations of data values from it
The average (i.e. sum(x)/N)
The central value
The most likely value to get if you were to sample from the population
Centre of ‘mass’
what is the median?
value that minimums the sum of abosulte UNSQUARED deviations of data values from it
which test if data is categorical?
chi squared
which test if data is ranked or ordinal?
non - parametric,
• Mann-Whitney U (or Wilcoxin Mann-Whitney) – t-test to compare medians of 2 independent groups (groups must be completely independent of each other)
• Wilcoxin One sample
• Wilcoxon signed-rank test – same as paired t-test, it compares two related samples, matched samples, or repeated measurements on a single sample
• Kruskal-Wallis (One-way ANOVA with independent measures) – extends Mann-Whitney U when there are >2 samples. It compares two or more independent samples of equal or different sample sizes
• Friedman test (one-way ANOVA with repeated measures)
• Spearman correlation (non-para version of Pearson correlation)
what test to use for interval or ratio data?
parametic, t-test, anova, reg etc
what is an outlier? and who made this definition
an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism - Hawkins
real life examples of when we look for outliers
Fraud detection & credit card theft- Unusual spending patterns
Medical diagnosis - Problems suggested by test results that don’t fit normal pattern for age/sex/history
Detecting drug cheats in sport - Abnormally high/low blood steroids, etc.
Detecting measurement errors or unusual events
3 ways to detect outliers
graphs used to detect them
boxplot- what do they need to be?
Analyse sample - data points that are ‘far from’ the rest are outliers
Fit model to data – outliers are points that don’t fit that model
Either approach can be statistical or purely graphical (‘by eye’)
Start with graphical methods
can use hist, scatterplot, normal probability plot (or QQ Plot), boxplot = Points higher than one-and-a-half inter-quartile ranges from the upper quartile, or lower than than one-and-a-half inter-quartile ranges from the lowerquartile, are plotted as circles and so identified as possible outliers.
what is the R package for outliers? and what could be a drawback?
Dixon test (only works with small sample sizes (<30) only
which two tests are robust to even quite large violations of normality and homogeneity of variance and when do they become less so?
ANOVA and t-test - One-way ANOVA only starts to give odd results if largest variance is >9x smallest
what is homogeneity of variance? what test is it used in and when can this not be violated?
The assumption of homogeneity of variance is that the variance within each of the populations is equal. This is an assumption of analysis of variance (ANOVA). ANOVA works well even when this assumption is violated except in the case where there are unequal numbers of subjects in the various groups.