Modelling Flashcards
mulitlevel modelling operates between 2 extremes
data are so highly correlated, n=1 per unit and data are completely uncorrelated
works by estimating which data needs ‘pooling’ or shrinkage - so reduces df to 1 for each unit
what are summary measures not good with?
unbalanced design
what is a binomial distribution?
two possible outcomes are equally likely
what is a binary response?
Y/N, absent/present
what does a general linear model assume about dist?
assumes that unexplained variation is normally distributed
what does generalised linear model assume about dist?
assumes that unexplained variation can follow some other known distribution
types of distributions and what they are
log- normal = if effects are multiplicative not additive
exponential = eg latency/survival, probability of evrything remains constant, waiting for event
weiball = survival with non-constant mortality
poission = random rare, discrete events
negative binomial= clustered discrete events
when is bootstrapping used?
when you want to estimate parameter of population and want to get estimate of CI
when is bootstrapping used? and what sample size?
when you want to estimate parameter of population and want to get estimate of CI for sample size larger than 50
what are contrasts?
allow for testing of pair-wise differences after ANOVA
forward multiple regression
start with no varialbes then add most sig and then next most sig
backwards multiple regression
start with all variables remove least sig, then so on
Stepwise multiple regression
start the same as forwards with no variables but at any time can remove non-sig terms
What measure do we use for the balance between fit of a model and no of parameters it measures?
Akaike information crieterion (AIK)
what is logistic regression? and what statistic does it use?
Characterised by a link response distribution (binomial) and a link function which transforms mean value to make it more linear
Uses z statisitc
what is deviance?
Deviance is a measure of goodness of fit of a generalized linear model.
Deviance in logistic regression should not be
residual deviance should not be 2x as large as df
what is a GLM?
The general linear model incorporates a number of different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The general linear model is a generalization of multiple linear regression model to the case of multiple predictors
GLM: error is Normal (mean = 0, sd = )
3 examples of a generalised linear model
linear regression, logistic regression and Poisson regression.
Generalized LM: error is… lots of possibilities
Skewed data because of two many zeros?
zero inflated data
what tests can you do when you have outliers?
parametric on ranked data, non parametric or permutation tests
what is survival anaylsis?
analysing expected duration of time unitl one or more events happen eg death
what is censoring
when the actual data point isnt known but you can set boundaries based on what it must have been
brackets in R () {} [] <>
() = using to bound an object during execution of a function {} = used to bound creation of a function [] = used to subscript an object <> = denotes greater or less than
what is data mining?
lots of candidate predictors but no strong theory to predict which should be important
2 general classes of cluster based anaylsis
supervised learning- know true identity of some clusters and use these to make predictive models when you dont know group memebership
unsupervised learning - dont know whats right or wrong so try and find natural clustering patterns in data
k-means clustering is an example of …
unsupervised learning
types of supervised clustering
DFA, logistic regression, mulimonial logisitc regression, neural networks, genetic algorithms
minkowski distance
a generalisation of a distance measure with eucidean and manhattan as special cases
measuring between data points is what type of clustering?
hierachical or agglomerative clustering
what is loess?
way to run a smooth average but it weighted
piecewiese regressions and splines?
piecewise - IV is partitioned into intervals and a sep line segment is fit to each interval
splines - fits polynomial sections then joins them up
name all 3 orthoganl contrasts
helmert, difference and polynomial
name 2 non orthogonal contrats
treatment and simple
how can we reduce influence of outliers?
use robust statistics
why is stepwise regression dangerous?
automatic so can miss effects of outliers, non-linearity and non-normality. there is an unseen inflation of false positives and cant use categorical so hve to code as contrasts first
3 benefits of multilevel modelling
gain df and power, can deal with unbalanced designs, predictors can be of diff levels and all incorporated into one model
4 examples of multivariate approach
PCA, FA, LDA/DFA, MANOVA
What is shrinkage? in multimodel
when data values arent any more correlated within levels than between
Two cautions of multivariate approaches?
At each level you are estimating variances, so if n is small at any one level, estimates may be wildly out. Methods were developed for BIG datasets
Different algorithms used in different packages
PCA and its assumptions
what is the PC1 and PC2
PCA can be used for data reduction, it is a simple linear transformation of robust data, so doesn’t depend upon assumptions about the data’s distribution.
It is a rotation of the original axes to create new axes such that –The First Principal Component (PC1) is, BY DEFINITION, the single axis that accounts for most of the variation in the original variables. This is the axis along which there is the tightest covariation of the original variables. – The Second Principal Component (PC2) is, BY DEFINITION, the NEXT single axis that accounts for most of the 2nd greatest amount of variation in the original variables, subject to the constraint that it is at right angles to (‘orthogonal’ or independent’ of) the first axis.
Components may be interpretable from their coefficients
PCA on raw data is called
covariance
PCA on standardised ‘z scores’
correlation, z score is mean of 0 and SD of `1
THIS IS THE NORM
what graph for a pca to help interpretation?
biplot
how many components is enough in PCA?
Eigenvalue > 1,
use screeplot and look for ‘natural break’
Subjective balance between lots of variation captured’ and ‘not too many components’ - harder to justify
Although PCA doesn’t rely on normality of variables, the answers you get will be more robust (and sensible) if …
there are no outliers having a big influence on the correlations between variables, and (ii) the relationships are linear.
Factor analysis
Conceptually similar to PCA, but underpinning logic rather different
Assumes there is a ‘hidden variable’ that drives the observed variables and their relationships
PCA vs FA
5 points
PCA - total variance
FA - shared variance
PCA - unaffected by no you chose to work with
FA - no of factors changes coeffiecients
PCA - rotation of orginial variables
FA- creates new axes then rotates these
PCA - data reduction
FA - not data reduction, uncovering hidden variables
PCA - doesnt rely on normality
FA - does rely on multivariate normality
what is verimax?
factors will have either large or small loadings of any particular variable
Aids interpretability. Most popular
what does it mean if you get sig p in FA?
There is a highly significant difference between the variance captured by the factor and the variation in the original variables. These factors are not enough.
DFA/LDA
Use to find the best (linear) separation between groups, based on multiple dependent (response) variables
Useful for generating predictions about group membership of new items (new data
what is Wilk’s lambda used in?
MANOVA
what are canonical variates
The ‘best separating dimensions’ in MANOVA
what is the Reverse of LDA?
and what is an alternative of LDA?
MANOVA - reverse
Logistic reg- alternative
Assumptions of LDA/DFA
Sample size of smallest group > number of predictors
Best to have at least 4-5 times as many observations as predictors
Normality of predictors (outliers are fatal, some skew is OK)
Homogeneity of variances & covariances: important (can use z-scores)
if some predictors highly correlated (multicolinearity) then analysis may fail or give unreliable results
what is use To test for an association between a set of response variables (y1, y2, y3…) and a set of predictor variables (x1, x2, x3,…)
Canonical correlation
quadratic discriminat anaylsis + nd -‘s
benefit is better group discrimination, the cost is that there is often ‘over-fitting’ so that while the discrimination works well for the ‘training’ data, it works less well in predicting group membership for new data.
which stat method would you follow with null and saturated models to compare to?
logistic reg
when would you want to change subject to a factor in R (subject
If this was a repeated-measures analysis, with subject as a random effect
what is the order or columns and rows in R?
rows come first and columns second in R
if the model fits well then this should be not much larger than two times the residual degrees of freedom … which stat does this refer to?
logistic regression
what does a poor fit in logisitic regression imply?
First, the response may not be related to x.
Second, the relationship between the logit and x may be non-linear
Third, there could be other variables affecting the response
what is a null model?
what is saturated model?
x has no effect, so proportions same for all values of x
model fits the data perfectly. The proportions are allowed to vary independently for every value of x. Fit x as factor not covariate
when would you use chi sq over f test anova?
when comparing deviances rather than mean squares
what is a loglinear model
loglinear model’ fitted in the same way as a logistic regression, just with predictors that are factors not continuous