Modelling Flashcards

1
Q

mulitlevel modelling operates between 2 extremes

A

data are so highly correlated, n=1 per unit and data are completely uncorrelated
works by estimating which data needs ‘pooling’ or shrinkage - so reduces df to 1 for each unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are summary measures not good with?

A

unbalanced design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a binomial distribution?

A

two possible outcomes are equally likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a binary response?

A

Y/N, absent/present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does a general linear model assume about dist?

A

assumes that unexplained variation is normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what does generalised linear model assume about dist?

A

assumes that unexplained variation can follow some other known distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

types of distributions and what they are

A

log- normal = if effects are multiplicative not additive
exponential = eg latency/survival, probability of evrything remains constant, waiting for event
weiball = survival with non-constant mortality
poission = random rare, discrete events
negative binomial= clustered discrete events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

when is bootstrapping used?

A

when you want to estimate parameter of population and want to get estimate of CI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when is bootstrapping used? and what sample size?

A

when you want to estimate parameter of population and want to get estimate of CI for sample size larger than 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are contrasts?

A

allow for testing of pair-wise differences after ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

forward multiple regression

A

start with no varialbes then add most sig and then next most sig

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

backwards multiple regression

A

start with all variables remove least sig, then so on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stepwise multiple regression

A

start the same as forwards with no variables but at any time can remove non-sig terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What measure do we use for the balance between fit of a model and no of parameters it measures?

A

Akaike information crieterion (AIK)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is logistic regression? and what statistic does it use?

A

Characterised by a link response distribution (binomial) and a link function which transforms mean value to make it more linear

Uses z statisitc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is deviance?

A

Deviance is a measure of goodness of fit of a generalized linear model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Deviance in logistic regression should not be

A

residual deviance should not be 2x as large as df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is a GLM?

A

The general linear model incorporates a number of different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The general linear model is a generalization of multiple linear regression model to the case of multiple predictors
GLM: error is Normal (mean = 0, sd = )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

3 examples of a generalised linear model

A

linear regression, logistic regression and Poisson regression.
Generalized LM: error is… lots of possibilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Skewed data because of two many zeros?

A

zero inflated data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what tests can you do when you have outliers?

A

parametric on ranked data, non parametric or permutation tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is survival anaylsis?

A

analysing expected duration of time unitl one or more events happen eg death

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is censoring

A

when the actual data point isnt known but you can set boundaries based on what it must have been

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
brackets in R 
() 
{}
[]
<>
A
() = using to bound an object during execution of a function
{} = used to bound creation of a function
[] = used to subscript an object
<> = denotes greater or less than
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

what is data mining?

A

lots of candidate predictors but no strong theory to predict which should be important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

2 general classes of cluster based anaylsis

A

supervised learning- know true identity of some clusters and use these to make predictive models when you dont know group memebership

unsupervised learning - dont know whats right or wrong so try and find natural clustering patterns in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

k-means clustering is an example of …

A

unsupervised learning

28
Q

types of supervised clustering

A

DFA, logistic regression, mulimonial logisitc regression, neural networks, genetic algorithms

29
Q

minkowski distance

A

a generalisation of a distance measure with eucidean and manhattan as special cases

30
Q

measuring between data points is what type of clustering?

A

hierachical or agglomerative clustering

31
Q

what is loess?

A

way to run a smooth average but it weighted

32
Q

piecewiese regressions and splines?

A

piecewise - IV is partitioned into intervals and a sep line segment is fit to each interval

splines - fits polynomial sections then joins them up

33
Q

name all 3 orthoganl contrasts

A

helmert, difference and polynomial

34
Q

name 2 non orthogonal contrats

A

treatment and simple

35
Q

how can we reduce influence of outliers?

A

use robust statistics

36
Q

why is stepwise regression dangerous?

A

automatic so can miss effects of outliers, non-linearity and non-normality. there is an unseen inflation of false positives and cant use categorical so hve to code as contrasts first

37
Q

3 benefits of multilevel modelling

A

gain df and power, can deal with unbalanced designs, predictors can be of diff levels and all incorporated into one model

38
Q

4 examples of multivariate approach

A

PCA, FA, LDA/DFA, MANOVA

39
Q

What is shrinkage? in multimodel

A

when data values arent any more correlated within levels than between

40
Q

Two cautions of multivariate approaches?

A

At each level you are estimating variances, so if n is small at any one level, estimates may be wildly out. Methods were developed for BIG datasets

Different algorithms used in different packages

41
Q

PCA and its assumptions

what is the PC1 and PC2

A

PCA can be used for data reduction, it is a simple linear transformation of robust data, so doesn’t depend upon assumptions about the data’s distribution.

It is a rotation of the original axes to create new axes such that –The First Principal Component (PC1) is, BY DEFINITION, the single axis that accounts for most of the variation in the original variables. This is the axis along which there is the tightest covariation of the original variables. – The Second Principal Component (PC2) is, BY DEFINITION, the NEXT single axis that accounts for most of the 2nd greatest amount of variation in the original variables, subject to the constraint that it is at right angles to (‘orthogonal’ or independent’ of) the first axis.

Components may be interpretable from their coefficients

42
Q

PCA on raw data is called

A

covariance

43
Q

PCA on standardised ‘z scores’

A

correlation, z score is mean of 0 and SD of `1

THIS IS THE NORM

44
Q

what graph for a pca to help interpretation?

A

biplot

45
Q

how many components is enough in PCA?

A

Eigenvalue > 1,
use screeplot and look for ‘natural break’
Subjective balance between lots of variation captured’ and ‘not too many components’ - harder to justify

46
Q

Although PCA doesn’t rely on normality of variables, the answers you get will be more robust (and sensible) if …

A

there are no outliers having a big influence on the correlations between variables, and (ii) the relationships are linear.

47
Q

Factor analysis

A

Conceptually similar to PCA, but underpinning logic rather different
Assumes there is a ‘hidden variable’ that drives the observed variables and their relationships

48
Q

PCA vs FA

5 points

A

PCA - total variance
FA - shared variance

PCA - unaffected by no you chose to work with
FA - no of factors changes coeffiecients

PCA - rotation of orginial variables
FA- creates new axes then rotates these

PCA - data reduction
FA - not data reduction, uncovering hidden variables

PCA - doesnt rely on normality
FA - does rely on multivariate normality

49
Q

what is verimax?

A

factors will have either large or small loadings of any particular variable
Aids interpretability. Most popular

50
Q

what does it mean if you get sig p in FA?

A

There is a highly significant difference between the variance captured by the factor and the variation in the original variables. These factors are not enough.

51
Q

DFA/LDA

A

Use to find the best (linear) separation between groups, based on multiple dependent (response) variables

Useful for generating predictions about group membership of new items (new data

52
Q

what is Wilk’s lambda used in?

A

MANOVA

53
Q

what are canonical variates

A

The ‘best separating dimensions’ in MANOVA

54
Q

what is the Reverse of LDA?

and what is an alternative of LDA?

A

MANOVA - reverse

Logistic reg- alternative

55
Q

Assumptions of LDA/DFA

A

Sample size of smallest group > number of predictors
Best to have at least 4-5 times as many observations as predictors
Normality of predictors (outliers are fatal, some skew is OK)
Homogeneity of variances & covariances: important (can use z-scores)
if some predictors highly correlated (multicolinearity) then analysis may fail or give unreliable results

56
Q

what is use To test for an association between a set of response variables (y1, y2, y3…) and a set of predictor variables (x1, x2, x3,…)

A

Canonical correlation

57
Q

quadratic discriminat anaylsis + nd -‘s

A

benefit is better group discrimination, the cost is that there is often ‘over-fitting’ so that while the discrimination works well for the ‘training’ data, it works less well in predicting group membership for new data.

58
Q

which stat method would you follow with null and saturated models to compare to?

A

logistic reg

59
Q

when would you want to change subject to a factor in R (subject

A

If this was a repeated-measures analysis, with subject as a random effect

60
Q

what is the order or columns and rows in R?

A

rows come first and columns second in R

61
Q

if the model fits well then this should be not much larger than two times the residual degrees of freedom … which stat does this refer to?

A

logistic regression

62
Q

what does a poor fit in logisitic regression imply?

A

First, the response may not be related to x.
Second, the relationship between the logit and x may be non-linear
Third, there could be other variables affecting the response

63
Q

what is a null model?

what is saturated model?

A

x has no effect, so proportions same for all values of x

model fits the data perfectly. The proportions are allowed to vary independently for every value of x. Fit x as factor not covariate

64
Q

when would you use chi sq over f test anova?

A

when comparing deviances rather than mean squares

65
Q

what is a loglinear model

A

loglinear model’ fitted in the same way as a logistic regression, just with predictors that are factors not continuous