models Flashcards

Question

# bagging (and trees) (nrc)

Answer 1

* bootstrap aggregation, an ensemble model method, often useful for high variance models (like trees) * basic algorithm (for trees) * generate a bootstrap sample of the data * train an unpruned tree model on the sample * allow for out-of-bag error estimates, with bootstrapping usually leaving some samples "out" of the tree growth phase * predictions are made by, eg for regression, averaging the output of all trees in the ensemble * can suffer from inter-tree correlation, and may not do well with large ensembles (eg > 50)

Answer 2

* ensemble tree method to reduce inter-tree correlation (as with eg bagging) * for each tree, * select a bootstrap sample of the data * train, where each split only considers a subset of the total set of predictors * do not prune * use some kind of eg averaging or majority to generate predictions from the forest

Answer 3

* a "stacked" model ensemble, where later models "refine" the results of earlier models in the stack * procedure: * each level's residual is used to train the next level's (tree) (in the simplest version: I think this relies on the loss function being SSE) * to make predictions from the fitted model, all (stacked) trees' outputs are added * for stochastic gradient boosting, add bagging; ie at each level of the stack, a bootstrap sample is taken for that tree's training

Answer 4

* linear discriminant analysis, a linear model * offers dimensionality reduction * max principal directions, for K classes, will only ever be the minimum of {number of predictors, K-1} * can produce class probabilities (at least Welch form) * (Welch) via Bayes * class-conditional, (P(X=x|class j)) for instance x, equal-covariance Gaussian probability distributions are fit to each of K classes in the predictor phase space * combined with class priors, class predictions for a test point arise from finding highest probability class (considering all K Gaussians) * (Fisher) * variance-based, a la signal-to-noise * maximize the distance between the "centers" of different groups, while minimizing the variance of the data within groups * overall the model is quite fussy--collinearities, center/scale, low/zero-variance predictors

Answer 5

* partial least squares discriminant analysis * the algorithm applies PLS (from regression) to an "enhanced" outcome data type, a one-hot encoded "outcome matrix" (vs vector) * model produces K (or K-1) tuples of values, eg (-1.2,9.85,1.3); take max as class prediction * there are some techniques to obtain class probabilities

Answer 6

* a linear model; aka predictive analysis for microarrays (PAM) * for K classes, locate the training set centroids for the K classes in predictor phase space * for a test instance, the predicted class is the closest centroid * the shrink level may be determined by subtracting a fixed amount from each predictor-coordinate * if a predictor dimension "collapses"--ie sees all centroids at its axis zero--then it is removed (feature selection) * the model is tuned over the "shrink" level (eg shrink ) of the centroids toward the origin * can produce probabilities and variable importance * favors centering and scaling first

Answer 7

* quadratic discriminant analysis (with or without regularization) * relax Welch LDA requirements of equal covariance matrices for Gaussians * the inter-class boundaries become quadratic * RDA * a kind of hybrid between LDA and QDA * a parameter linearly mixes between single covariance matrix for all classes (Welch LDA), and separate covariance matrices (QDA) * considerations: * need sufficient number of samples per class * avoid collinearities * too many one-hot/binary predictors can make QDA no better than LDA

Answer 8

* mixture discriminant analysis * assuming a fixed covariance matrix, model each class-conditional distribution with one \*or more\* Gaussian distributions * can be regularized (L1 and L2) * may not work well with more complex class boundaries, or a lot of binary / one-hot predictors

Answer 9

* flexible discriminant analysis, a kind of regularized LDA; non-linear (in general) * instead of Welch's fitting class conditionals with same-covariance-matrix Gaussians, fit with a more flexible (non-parametric) model, like MARS * the inner model (like MARS) acts, among other things, like a basis expansion, which then has LDA applied to it * more flexible than LDA, with more complex boundaries, but may overfit

Answer 10

* at the core, a discriminant function, D(u) = B0 + B'\*dot\*u, with B normal to separating hyperplane * as the model is trained B' "becomes" a matrix of weighted training set instances, called the support vectors (in the separable case, these amount to instances on the boundary-margin hyperplanes) * the kernel trick * allows subbing B'\*dot\*u with a sum over (support vector) training set instances, each "wrapped" in a symmetric positive definite kernel, then weighted * effectively amounts to a basis expansion * reqs: * center and scale * can be negatively affected by non-informative predictors * class probabilities are not "native"

Answer 11

* for non-gradient, eg for trees / AdaBoost: * samples, along with stack-level trees are each weighted (stage weights) * stage weights reflect that stack-level tree's overall classification error * sample weights are updated according to how well/poorly the tree at that level classifies the sample * to make a prediction, sum over all tree levels, with each level's class prediction multiplied by its stage weight (for overall <>0) * for gradient-based methods, similar to AdaBoost but would: * include a loss function, such as multinomial deviance * train the next tree in the stack on targets derived from gradient of loss function

models Flashcards

(35 cards)