Lec 19 and 20 Flashcards

Question

when pruning a tree in CART do errors go up or down with no of predictors?

Answer 1

error goes down with no of predictors

Answer 2

For any classification method, your ‘best model’ will always perform well on the sample you developed the model for So a risk of over-fitting So, build model using one set of data and test it on another (=cross-validation) partitioning the original sample into a training set to train the model, and a test set to evaluate it

Answer 3

``` It’s crude: doesn’t make use of continuous information; simply splits such variables as ‘high’ or ‘low’ But it’s robust Variables can have any distribution And powerful Examines all interactions And gives results in an easy-to-use form A decision tree ```

Answer 4

Supervised’ learning – know the true identity of some clusters and use these to develop a predictive model for data where you don’t know group membership Unsupervised’ learning – don’t know what’s ‘right’ or ‘wrong’, so try and find natural clustering patterns in the data (which can then be used in future prediction)

Answer 5

supervised -Discriminant function analysis Logistic regression (two groups) Multinomial logistic regression (>2 groups and no order) Support Vector Machines Neural networks Genetic algorithms unsupervised - k-means clustering Specify number of clusters to find Allocates data to clusters to minimise within-cluster sums-of-squares

Answer 6

Use robust methods like CART to explore predictive relationships Use clustering to expose unexpected ‘structure’ (groupings) in your data

Answer 7

Multinomial logistic regression (>2 groups and no order)

Answer 8

stripchart or dotplot

Answer 9

partially summarised data

Answer 10

stripchart, sctterplot, histogram (to explore whole distributions)

Answer 11

If the notches of 2 boxes don’t overlap, medians are likely to differ

Answer 12

A boxplot consists starts at the 25th percentile and ends at the 75th percentile, so this box contains 50% of the data. The interquartile range is the difference between the 25th and 75th percentiles, which are also referred to as the lower and upper quartiles.

Answer 13

Simplify correlations to ellipses with hue & intensity indicating direction & strength

Answer 14

ANCOVA - can represent in sep panel or Or in same plot, with separate regression lines USE SHINGLING to help visualisation

Answer 15

testing hypothesis, makes inferences about populations using data drawn from the population.

Answer 16

for investigating the relationship between multiple variables , is a plot of y against x conditioned upon (or broken down by) a third variable (or more if you have them). If the third variable is categorical (a factor with discrete levels/groups), then you get a plot of y against x for each level of the third variable, each in a separate panel If the third variable is continuous, the third variable is broken into similarly-sized groups, somewhat overlapping, and you get a plot of y against x for each of those.

Answer 17

allocates data to clusters to minimise within-cluster sum of squares

Answer 18

need a measure of distance between points, join the clostest two points, when two or more are joined they are treated as an item with a single 'location' ... once all joined up, then create a pattern of linkage tree

Answer 19

CART for exploring predictive relationships, agglo/hierrchical to expose unexpected structures.