Modelling Flashcards

Question

what is data mining?

Answer 1

lots of candidate predictors but no strong theory to predict which should be important

Answer 2

supervised learning- know true identity of some clusters and use these to make predictive models when you dont know group memebership unsupervised learning - dont know whats right or wrong so try and find natural clustering patterns in data

Answer 3

unsupervised learning

Answer 4

DFA, logistic regression, mulimonial logisitc regression, neural networks, genetic algorithms

Answer 5

a generalisation of a distance measure with eucidean and manhattan as special cases

Answer 6

hierachical or agglomerative clustering

Answer 7

way to run a smooth average but it weighted

Answer 8

piecewise - IV is partitioned into intervals and a sep line segment is fit to each interval splines - fits polynomial sections then joins them up

Answer 9

helmert, difference and polynomial

Answer 10

treatment and simple

Answer 11

use robust statistics

Answer 12

automatic so can miss effects of outliers, non-linearity and non-normality. there is an unseen inflation of false positives and cant use categorical so hve to code as contrasts first

Answer 13

gain df and power, can deal with unbalanced designs, predictors can be of diff levels and all incorporated into one model

Answer 14

PCA, FA, LDA/DFA, MANOVA

Answer 15

when data values arent any more correlated within levels than between

Answer 16

At each level you are estimating variances, so if n is small at any one level, estimates may be wildly out. Methods were developed for BIG datasets Different algorithms used in different packages

Answer 17

PCA can be used for data reduction, it is a simple linear transformation of robust data, so doesn’t depend upon assumptions about the data’s distribution. It is a rotation of the original axes to create new axes such that --The First Principal Component (PC1) is, BY DEFINITION, the single axis that accounts for most of the variation in the original variables. This is the axis along which there is the tightest covariation of the original variables. -- The Second Principal Component (PC2) is, BY DEFINITION, the NEXT single axis that accounts for most of the 2nd greatest amount of variation in the original variables, subject to the constraint that it is at right angles to (‘orthogonal’ or independent’ of) the first axis. Components may be interpretable from their coefficients

Answer 18

covariance

Answer 19

correlation, z score is mean of 0 and SD of `1 | THIS IS THE NORM

Answer 20

Eigenvalue > 1, use screeplot and look for 'natural break' Subjective balance between lots of variation captured’ and ‘not too many components’ - harder to justify

Answer 21

there are no outliers having a big influence on the correlations between variables, and (ii) the relationships are linear.

Answer 22

Conceptually similar to PCA, but underpinning logic rather different Assumes there is a ‘hidden variable’ that drives the observed variables and their relationships

Answer 23

PCA - total variance FA - shared variance PCA - unaffected by no you chose to work with FA - no of factors changes coeffiecients PCA - rotation of orginial variables FA- creates new axes then rotates these PCA - data reduction FA - not data reduction, uncovering hidden variables PCA - doesnt rely on normality FA - does rely on multivariate normality

Answer 24

factors will have either large or small loadings of any particular variable Aids interpretability. Most popular

Answer 25

There is a highly significant difference between the variance captured by the factor and the variation in the original variables. These factors are not enough.

Answer 26

Use to find the best (linear) separation between groups, based on multiple dependent (response) variables Useful for generating predictions about group membership of new items (new data

Answer 27

The 'best separating dimensions' in MANOVA

Answer 28

MANOVA - reverse | Logistic reg- alternative

Answer 29

Sample size of smallest group > number of predictors Best to have at least 4-5 times as many observations as predictors Normality of predictors (outliers are fatal, some skew is OK) Homogeneity of variances & covariances: important (can use z-scores) if some predictors highly correlated (multicolinearity) then analysis may fail or give unreliable results

Answer 30

Canonical correlation

Answer 31

benefit is better group discrimination, the cost is that there is often 'over-fitting' so that while the discrimination works well for the 'training' data, it works less well in predicting group membership for new data.

Answer 32

logistic reg

Answer 33

If this was a repeated-measures analysis, with subject as a random effect

Answer 34

rows come first and columns second in R

Answer 35

logistic regression

Answer 36

First, the response may not be related to x. Second, the relationship between the logit and x may be non-linear Third, there could be other variables affecting the response

Answer 37

x has no effect, so proportions same for all values of x model fits the data perfectly. The proportions are allowed to vary independently for every value of x. Fit x as factor not covariate

Answer 38

when comparing deviances rather than mean squares

Answer 39

loglinear model’ fitted in the same way as a logistic regression, just with predictors that are factors not continuous

Modelling Flashcards

(65 cards)