Lecture 1 - Basics Flashcards
What is a parameter?
an attribute of a population, or relationship between populations or variables
What is a population?
Complete set of items that you want to draw inferences
What is a sample?
a random subset of population
exploratory vs confirmatory
Exploratory is generating hypotheses, confirmatory is testing hypotheses
how to calc SE
SD divided by the square root of sample number
type 1 error
false rejection of H0
type 2 error
false acceptance of H0
internal validity
extent to which you believe results of the study
external validity
extent to which results apply to real word
observational vs experimental
in an experiment, only one factor varies between conditions, so if theres a diff you can be certain it is casual
what is the mean? for norm dist
value that minimises the sum of SQUARED deviations of data values from it
The average (i.e. sum(x)/N)
The central value
The most likely value to get if you were to sample from the population
Centre of ‘mass’
what is the median?
value that minimums the sum of abosulte UNSQUARED deviations of data values from it
which test if data is categorical?
chi squared
which test if data is ranked or ordinal?
non - parametric,
• Mann-Whitney U (or Wilcoxin Mann-Whitney) – t-test to compare medians of 2 independent groups (groups must be completely independent of each other)
• Wilcoxin One sample
• Wilcoxon signed-rank test – same as paired t-test, it compares two related samples, matched samples, or repeated measurements on a single sample
• Kruskal-Wallis (One-way ANOVA with independent measures) – extends Mann-Whitney U when there are >2 samples. It compares two or more independent samples of equal or different sample sizes
• Friedman test (one-way ANOVA with repeated measures)
• Spearman correlation (non-para version of Pearson correlation)
what test to use for interval or ratio data?
parametic, t-test, anova, reg etc
what is an outlier? and who made this definition
an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism - Hawkins
real life examples of when we look for outliers
Fraud detection & credit card theft- Unusual spending patterns
Medical diagnosis - Problems suggested by test results that don’t fit normal pattern for age/sex/history
Detecting drug cheats in sport - Abnormally high/low blood steroids, etc.
Detecting measurement errors or unusual events
3 ways to detect outliers
graphs used to detect them
boxplot- what do they need to be?
Analyse sample - data points that are ‘far from’ the rest are outliers
Fit model to data – outliers are points that don’t fit that model
Either approach can be statistical or purely graphical (‘by eye’)
Start with graphical methods
can use hist, scatterplot, normal probability plot (or QQ Plot), boxplot = Points higher than one-and-a-half inter-quartile ranges from the upper quartile, or lower than than one-and-a-half inter-quartile ranges from the lowerquartile, are plotted as circles and so identified as possible outliers.
what is the R package for outliers? and what could be a drawback?
Dixon test (only works with small sample sizes (<30) only
which two tests are robust to even quite large violations of normality and homogeneity of variance and when do they become less so?
ANOVA and t-test - One-way ANOVA only starts to give odd results if largest variance is >9x smallest
what is homogeneity of variance? what test is it used in and when can this not be violated?
The assumption of homogeneity of variance is that the variance within each of the populations is equal. This is an assumption of analysis of variance (ANOVA). ANOVA works well even when this assumption is violated except in the case where there are unequal numbers of subjects in the various groups.
what are Robust’ statistical methods? (there are 4 answers)
Non-parametric tests
Simple, accepted, but range of tests limited
Parametric tests on ranked data
Permutation tests
Flexible, but require decent sample size per group (see lecture on Monte Carlo methods)
Parametric methods that work on the ‘middle’ of the data
non parametric of the following: one sample t test two sample t test paired t test one way anova one way repeated anova corrrelation
one sample t test = Wilcoxin two sample t test = Mann Whitney U paired t test = Wilcoxin matched pairs signed rank one way anova = Kruskal wallis one wau repeated anova = friedman test corrrelation = spearman
when residuals are normally distributed (and so parametrics are OK) the non-parametric equivalents are … less powerful.
5%
When residuals are non-normal, what has more power? non para or para?
non parametric
what are false positives?
type 1 error
pros and cons of ranked parametric
Doing parametric stats on rank-transformed data seems to give you the same Type I error rate and power as the custom non-parametric test
… and has the bonus of more possible tests being available (i.e. any parametric test)
But, you can’t interpret the exact form of a relationship sensibly, so beware
limitation of reg on ranked data
Regression of rank(Y) on rank(X) tells you they are related, but not the shape of curve
robust regression ignores outliers
why are missing values bad?
Lead to unbalanced design and so…
Loss of power (a shame but not tragic)
types of missing values
Missing Completely At Random - good
Missing At Random = Unobserved data are not random but follow the same pattern as the observed values
Not Missing At Random = Unobserved values are different from the observed ones – a bigger problem.
when are missing values okay?
when they are missing completely at random, or they are not random but follow the same pattern as the observed values.
when are missing values bad?
if not missing at random and if the unobserved values are different from the observed ones
what to do if missing at random?
ignore NA’s, unless repeated measures..
Step 1 : investigate missing values, where are they
Find whether the ‘NA’s are randomly distributed with respect to the predictors and responses
can do Loglinear model on missing/non-missing to see if they are missing at random!
or can Replace with values that don’t bias subsequent analyses - such as most common value for that variable if there is normal distribution
BUT if data skewed use median!!
Or using other variables that are correlated OR using 10 nearest neighbours
what does DMwR package use when fitting most central value?
median for numeric variables
the mode for categorical variables
survival analysis examples
-Parametric
Specify distribution, e.g. exponential, Weibull
-Cox regression
‘semi-parametric’ – assumes survival curves have same shape, but different rates
-Non-parametric
Can only test one factor
what to do with outliers?
delete if known cause.
Use robust statistics to reduce their influence
Or replace with ‘non-influential’ values
what is censoring?
Data points where the actual value isn’t known but you can set boundaries on what it must have been
Censoring is a special case of ‘partial information’ (not missing) that can be dealt with by survival analysis
what is the basis to a statistical model?
Pattern observed = signal + noise
the mean is a fixed signal, as is a slope, as is the difference between two means
Random normal variation – the ‘noise’
what are modeling?
not the data, but the population from which our sample came from
overfitting.. how can it come about
more parameters can mean better fit, but risk of overfiting and can just end up redescribing the data
Can all population parameters be estimated accurately with a large sample size?
no, better to start with the mean - can estimate the population mean from your sample mean and the likely range of values around this from your sample standard deviation
why is SD biased?
Because the values in a sample are more likely to come from near the mean
So the sample is unlikely to have as many extreme values as the parent population
So the variance (and stdev) of the sample are lower than that of the population
how does a t test calculate its value?
sample mean measured in “standard errors” by dividing your sample mean by the standard error
what is power?
what do you need to know to calc it?
probablity of rejecting the null hyp if the null hyp is false i.e. correct rejection of H0
probability of a type II error
need effect size and expected SD
internal validity
Internal validity is the extent to which you believe the results of the study
Factors that increase internal validity: Homogeneous sample (e.g. single strain, single sex, single genotype) Homogeneous conditions (constant temperature, humidity, lighting regime, diet) Single experimenter administering treatments
Is it ever justified to raise the threshold p-value?
yes if type 2 error is far worse than a Type I error
problems of mulitple testing
the chance of at least one test being ‘significant’ (p<0.05) is around 40% EVEN WHEN THE NULL HYPOTHESIS IS TRUE!
solutions to avoid multiple testing
Fix primary and secondary dependent variables a priori
Control experiment-wise (family-wise) alpha and adjust test-specific alpha accordingly
Use multivariate methods
Data reduction
pseduoreplication?
Improper inflation of sample size due to non-independence of data
objects in R include..
Single variables
Arrays and matrices
Structures (containing, e.g., different types of variables: text or numeric
sep “”
sep ‘\t’
sep ‘\n’
sep “” = space delimited data
sep ‘\t’ = tab-delimited data
sep ‘\n’ = new line delimited data
correlation vs regression?
correlation requires norm dist, regression - Residuals around the line relating y to x must be normally distributed, x need not be normal (nor even y)
Ordinary Least Squares
we minimise the squares of the residuals, (+ve/−ve diffs have same effect)
MA vs RMA (diffs in variance)
Major Axis (MA) Regression if variances similar
Reduced Major Axis (RMA) Regression
if variances unequal