Modelling Flashcards

1
Q

mulitlevel modelling operates between 2 extremes

A

data are so highly correlated, n=1 per unit and data are completely uncorrelated
works by estimating which data needs ‘pooling’ or shrinkage - so reduces df to 1 for each unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are summary measures not good with?

A

unbalanced design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a binomial distribution?

A

two possible outcomes are equally likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a binary response?

A

Y/N, absent/present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does a general linear model assume about dist?

A

assumes that unexplained variation is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does generalised linear model assume about dist?

A

assumes that unexplained variation can follow some other known distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

types of distributions and what they are

A

log- normal = if effects are multiplicative not additive
exponential = eg latency/survival, probability of evrything remains constant, waiting for event
weiball = survival with non-constant mortality
poission = random rare, discrete events
negative binomial= clustered discrete events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

when is bootstrapping used?

A

when you want to estimate parameter of population and want to get estimate of CI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when is bootstrapping used? and what sample size?

A

when you want to estimate parameter of population and want to get estimate of CI for sample size larger than 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are contrasts?

A

allow for testing of pair-wise differences after ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

forward multiple regression

A

start with no varialbes then add most sig and then next most sig

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

backwards multiple regression

A

start with all variables remove least sig, then so on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stepwise multiple regression

A

start the same as forwards with no variables but at any time can remove non-sig terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What measure do we use for the balance between fit of a model and no of parameters it measures?

A

Akaike information crieterion (AIK)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is logistic regression? and what statistic does it use?

A

Characterised by a link response distribution (binomial) and a link function which transforms mean value to make it more linear

Uses z statisitc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is deviance?

A

Deviance is a measure of goodness of fit of a generalized linear model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Deviance in logistic regression should not be

A

residual deviance should not be 2x as large as df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is a GLM?

A

The general linear model incorporates a number of different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The general linear model is a generalization of multiple linear regression model to the case of multiple predictors
GLM: error is Normal (mean = 0, sd = )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

3 examples of a generalised linear model

A

linear regression, logistic regression and Poisson regression.
Generalized LM: error is… lots of possibilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Skewed data because of two many zeros?

A

zero inflated data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what tests can you do when you have outliers?

A

parametric on ranked data, non parametric or permutation tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is survival anaylsis?

A

analysing expected duration of time unitl one or more events happen eg death

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is censoring

A

when the actual data point isnt known but you can set boundaries based on what it must have been

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
brackets in R 
() 
{}
[]
<>
A
() = using to bound an object during execution of a function
{} = used to bound creation of a function
[] = used to subscript an object
<> = denotes greater or less than
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
what is data mining?
lots of candidate predictors but no strong theory to predict which should be important
26
2 general classes of cluster based anaylsis
supervised learning- know true identity of some clusters and use these to make predictive models when you dont know group memebership unsupervised learning - dont know whats right or wrong so try and find natural clustering patterns in data
27
k-means clustering is an example of ...
unsupervised learning
28
types of supervised clustering
DFA, logistic regression, mulimonial logisitc regression, neural networks, genetic algorithms
29
minkowski distance
a generalisation of a distance measure with eucidean and manhattan as special cases
30
measuring between data points is what type of clustering?
hierachical or agglomerative clustering
31
what is loess?
way to run a smooth average but it weighted
32
piecewiese regressions and splines?
piecewise - IV is partitioned into intervals and a sep line segment is fit to each interval splines - fits polynomial sections then joins them up
33
name all 3 orthoganl contrasts
helmert, difference and polynomial
34
name 2 non orthogonal contrats
treatment and simple
35
how can we reduce influence of outliers?
use robust statistics
36
why is stepwise regression dangerous?
automatic so can miss effects of outliers, non-linearity and non-normality. there is an unseen inflation of false positives and cant use categorical so hve to code as contrasts first
37
3 benefits of multilevel modelling
gain df and power, can deal with unbalanced designs, predictors can be of diff levels and all incorporated into one model
38
4 examples of multivariate approach
PCA, FA, LDA/DFA, MANOVA
39
What is shrinkage? in multimodel
when data values arent any more correlated within levels than between
40
Two cautions of multivariate approaches?
At each level you are estimating variances, so if n is small at any one level, estimates may be wildly out. Methods were developed for BIG datasets Different algorithms used in different packages
41
PCA and its assumptions | what is the PC1 and PC2
PCA can be used for data reduction, it is a simple linear transformation of robust data, so doesn’t depend upon assumptions about the data’s distribution. It is a rotation of the original axes to create new axes such that --The First Principal Component (PC1) is, BY DEFINITION, the single axis that accounts for most of the variation in the original variables. This is the axis along which there is the tightest covariation of the original variables. -- The Second Principal Component (PC2) is, BY DEFINITION, the NEXT single axis that accounts for most of the 2nd greatest amount of variation in the original variables, subject to the constraint that it is at right angles to (‘orthogonal’ or independent’ of) the first axis. Components may be interpretable from their coefficients
42
PCA on raw data is called
covariance
43
PCA on standardised 'z scores'
correlation, z score is mean of 0 and SD of `1 | THIS IS THE NORM
44
what graph for a pca to help interpretation?
biplot
45
how many components is enough in PCA?
Eigenvalue > 1, use screeplot and look for 'natural break' Subjective balance between lots of variation captured’ and ‘not too many components’ - harder to justify
46
Although PCA doesn't rely on normality of variables, the answers you get will be more robust (and sensible) if ...
there are no outliers having a big influence on the correlations between variables, and (ii) the relationships are linear.
47
Factor analysis
Conceptually similar to PCA, but underpinning logic rather different Assumes there is a ‘hidden variable’ that drives the observed variables and their relationships
48
PCA vs FA | 5 points
PCA - total variance FA - shared variance PCA - unaffected by no you chose to work with FA - no of factors changes coeffiecients PCA - rotation of orginial variables FA- creates new axes then rotates these PCA - data reduction FA - not data reduction, uncovering hidden variables PCA - doesnt rely on normality FA - does rely on multivariate normality
49
what is verimax?
factors will have either large or small loadings of any particular variable Aids interpretability. Most popular
50
what does it mean if you get sig p in FA?
There is a highly significant difference between the variance captured by the factor and the variation in the original variables. These factors are not enough.
51
DFA/LDA
Use to find the best (linear) separation between groups, based on multiple dependent (response) variables Useful for generating predictions about group membership of new items (new data
52
what is Wilk's lambda used in?
MANOVA
53
what are canonical variates
The 'best separating dimensions' in MANOVA
54
what is the Reverse of LDA? | and what is an alternative of LDA?
MANOVA - reverse | Logistic reg- alternative
55
Assumptions of LDA/DFA
Sample size of smallest group > number of predictors Best to have at least 4-5 times as many observations as predictors Normality of predictors (outliers are fatal, some skew is OK) Homogeneity of variances & covariances: important (can use z-scores) if some predictors highly correlated (multicolinearity) then analysis may fail or give unreliable results
56
what is use To test for an association between a set of response variables (y1, y2, y3...) and a set of predictor variables (x1, x2, x3,...)
Canonical correlation
57
quadratic discriminat anaylsis + nd -'s
benefit is better group discrimination, the cost is that there is often 'over-fitting' so that while the discrimination works well for the 'training' data, it works less well in predicting group membership for new data.
58
which stat method would you follow with null and saturated models to compare to?
logistic reg
59
when would you want to change subject to a factor in R (subject
If this was a repeated-measures analysis, with subject as a random effect
60
what is the order or columns and rows in R?
rows come first and columns second in R
61
if the model fits well then this should be not much larger than two times the residual degrees of freedom ... which stat does this refer to?
logistic regression
62
what does a poor fit in logisitic regression imply?
First, the response may not be related to x. Second, the relationship between the logit and x may be non-linear Third, there could be other variables affecting the response
63
what is a null model? | what is saturated model?
x has no effect, so proportions same for all values of x model fits the data perfectly. The proportions are allowed to vary independently for every value of x. Fit x as factor not covariate
64
when would you use chi sq over f test anova?
when comparing deviances rather than mean squares
65
what is a loglinear model
loglinear model’ fitted in the same way as a logistic regression, just with predictors that are factors not continuous