models Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Provost’s 9 main model categories

A

clustering / segmentation (u)
classification (s)
regression (s)
similarity matching (s, u)
co-occurrence grouping (u)
profiling (u)
link prediction (s, u)
data reduction (s, u)
causal modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

linear discriminant

A

a hyperplanar discriminant for a binary target variable will split the attribute phase space into 2 regions

fitting:
* we can apply entropy measure to the two resulting segments, to check for information gain (weighting each side by the number of instances in it)
* we can check the means of each of the classes along the hyperplane normal, and seek maximum inter-mean separation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

probability estimation tree

A

a classification tree that may be considered a hybrid between classification and regression models

leaves are annotated with a category value, and a probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

decision tree (general)

A

for regression or classification

tunable via

  • minimum leaf size
  • number of terminal leaves allowed
  • number of nodes allowed
  • tree depth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

support vector machines (linear)

A

simplest case involves a hyperplanar fitting surface, in combination with L2 regularization, and possibly a hinge loss function

via the kernel trick, more sophsiticated fitting surfaces can be used

support vectors consist of a subset of the training instances used to fit the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

logistic regression

A

aka logit model; typically used for modeling binary classification probabilities

in simplest form:

  • a simple linear regression model in a sigmoid wrapper: 1/(1+exp(M)) where M is the linear regression model (ie linear hyperplane scalar field over the attribute phase space)
  • this is a generalized linear model, under the transform log(p/(1-p)) = multiple_regression_model

for a special logistic loss function, the loss surface is convex, allowing steepest descent

can be regularized on coefficients of linear kernel, via L1 and/or L2

offers a linear model’s interpretability, with a linear models drawbacks (eg collinearities)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

hierarchical clustering

A

under some (cluster) metric, find the two closest clusters, and merge them; iterate

the cluster metric is called the linkage function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

centroid clustering

A

each cluster is represented by its cluster center, or centroid

k-means method

choose starting centers for k clusters in the predictor phase space, then iterate (can be tuned over different k):
* assign each instance to the cluster it’s closest to
* calculate the centroid of each of the resulting clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

naive Bayes

A

for classification

generative; features are considered for giving evidence for or against target variable values; each instance gets its own pdf

allows instant updating, with new data (Bayesian property)

relies on the class (c) as the prior, with the instance the conditioning event: p(E|C=c) = p(C=c|E)p(C=c) / p(E)

probability of class C=c, given instance E, where e_i are individual instance-predictor values or ranges:

  • p(C=c|E) = p(e_1|c)…p(e_k|c)p(C=c) / p(E)
  • this assumes strong independence of effect of individual predictors on class values
  • without the independence assumption, p(C=c|E) is very hard to compute (“sparseness” of individual instances)

p(E)

  • can be difficult to compute accurately, so naive Bayes may leave it out, yielding a ranking classifier (ie relative class confidence vs true probabilities)
  • however, a full formula does exist, which includes p(E)

further simplified (with p(E) decomposed), to put in terms of predictor lift: p(c|E) = p(e_1|c)…p(e_k|c)p(c) / p(e_1)…p(e_k)

remove near- and zero-variance predictors; careful of few-unique-value predictors (give weird pdfs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

non-parametric regression models

A
  • no parametric form is assumed for the relationship between predictors and dependent variable
  • the predictor does not take a predetermined form but is constructed according to information derived from the data; eg KNN, MARS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

generalized linear models

A
  • a family of models where a function of the outcome variable follows a (basic) linear regression model
  • eg log(p/(1-p)) follows a linear regression model in logistic regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

parsimonious model

A

a model that accomplishes the desired level of explanation or prediction with as few predictor variables as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

linear regression (lr)

A
  • aka OLS; fit a hyperplane to the outcome variable, using a least squares condition
  • solves normal equations, and gives fit statistics (p-value on coefficients, overall R^2, etc.)
  • does not handle collinearities well (ill-conditioned or non-invertible matrices)
  • note
    • for multiple predictors it’s called multiple linear regression
    • for multiple outcome variables, it’s multivariate linear regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PCR (lr)

A
  • PCA can be conducted prior to fitting a linear regression model, hence PCR
  • some cutoff (eg in the scree plot) is set, to retain only the most “important” (orthogonal) PCA components, and the model is trained on them
  • PCA “in the limit” tends to perform about as well as partial least squares (PLS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PLS (lr)

A
  • partial least squares, a supervised dimension reduction procedure; takes into account both (a) collinearities, and (b) component effect on outcome variable
  • assumes a linear fit under the hood; the presence / abundance of nonlinear relationships can produce problems with PLS
  • PLS may have trouble with non-informative predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

penalized / regularized linear regression models (lr)

A

three main flavors
* lasso: applies an L1 norm penalty on OLS regression coefficients; has the potential to fully remove predictors
* ridge: applies an L2 norm penalty on OLS regression coefficients
* elasticnet: combines L1 and L2 penalties

17
Q

neural networks (nrc)

A
  • single- and multi-layer perceptrons (SLP / MLP)–basic neural networks with, respectively, no hidden layers, or one or more hidden layers
  • deep learning / deep neural networks (DNN) contain SLP/MLP, but also get more sophsiticated, by eg applying a convolution kernel in the hidden layers (vs just summing over weights), as in CNNs
  • types
    • for regression, can use raw outputs
    • for classification, can use sigmoid a/o softmax (and may assume outputs class probabilities)
  • pros/cons:
    • can model highly non-linear data
    • tend to overfit
    • can be adversely affected by predictor collinearities
18
Q

MARS (nr)

A
  • multivariate adaptive regression splines
  • breaks the each predictor’s range into 2 groups and models linear relationships between the predictor and the outcome in each group
  • the changeover point between the 2 groups is called the cut point, aka knot
  • also does feature pruning (ie removing one and/or other side of cut point)
  • an nth degree MARS model allows nth degree terms in the predictors (eg prd_1^(n-2)*prd_2^2)
19
Q

SVM (nr)

A
  • support vector machines
  • epsilon-insensitive regression version
    • eg with a linear kernel, there is an epsilon “buffer” above and below the fitting hyperplane, where sample points within this buffer do not contribute to the fit / cost function
    • sample points outside the espilon buffer are eligible to be the “support vectors” that are used in the fitted model
      the cost parameter is the “main tool” for adjusting the complexity of these SVM models
  • cost tuning:
    • a large cost will amplify effects of errors, making the model “very flexible” (with risk of overfitting)
    • a small cost will make the model more rigid (and less likely to overfit)
20
Q

KNN (nrc)

A
  • consider k nearest neighbors to a test point
    • regression–average their values
    • classification
      • quorum sense (take eg most frequent)
      • take proportions (for probabilities)
  • the metric that determines “nearest” can be varied
  • reqs:
    • center and scale prior
    • susceptible to noisy or irrelevant predictors
21
Q

decision trees (general / CART) (nrc)

A
  • depth of tree–maximum number of nodes from root to leaf
  • leafs / terminal nodes can have their own prediction equation:
    • for regression, average the samples
    • for classification, take majority class, or class proportions for probability
  • are developed by considering many different splits at each node
    • search over every predictor, and every value of that predictor
    • pick best error-reducing split
      • for regression, SSE
      • for classification, node purity via entropy or Gini
  • pruning may occur post-initial-fit, using a complexity penalty, a constant multiplied by the number of nodes, which gets added to the tree error metric
  • may be susceptible to
    • large-degree collinearities
    • predictor-granularity selection bias
    • high variability in tree structure between model fits (w/ associated variance in accuracy)
22
Q

conditional inference trees (nr)

A
  • each proposed split point is assessed statistically, for significance between the two, post-split means, ie via t-test on difference bewteen population means
  • may help address selection bias problem between predictors of varying granularity
23
Q

regression model trees (nr)

A
  • instead of averaging outcomes in a leaf, use a (regression) model at every node
  • pruning occurs after initial fit, to remove inadequate subtrees
  • once the tree is fit, a prediction at a leaf / terminal node is made by averaging node-model-tree results along the sample’s path
24
Q

rule based models (nrc)

A
  • generally start with creating a tree, then the total set of rules associated with the tree are refined
  • rule-based models allow further customization beyond tree-based models, by tweaking tree rules, eg removing entire rules, or some conditions defining a particular rule
25
Q

bagging (and trees) (nrc)

A
  • bootstrap aggregation, an ensemble model method, often useful for high variance models (like trees)
  • basic algorithm (for trees)
    • generate a bootstrap sample of the data
    • train an unpruned tree model on the sample
  • allow for out-of-bag error estimates, with bootstrapping usually leaving some samples “out” of the tree growth phase
  • predictions are made by, eg for regression, averaging the output of all trees in the ensemble
  • can suffer from inter-tree correlation, and may not do well with large ensembles (eg > 50)
26
Q

random forests (nrc)

A
  • ensemble tree method to reduce inter-tree correlation (as with eg bagging)
  • for each tree,
    • select a bootstrap sample of the data
    • train, where each split only considers a subset of the total set of predictors
    • do not prune
  • use some kind of eg averaging or majority to generate predictions from the forest
27
Q

gradient boosting (plain and stochastic) (nr)

A
  • a “stacked” model ensemble, where later models “refine” the results of earlier models in the stack
  • procedure:
    • each level’s residual is used to train the next level’s (tree) (in the simplest version: I think this relies on the loss function being SSE)
    • to make predictions from the fitted model, all (stacked) trees’ outputs are added
  • for stochastic gradient boosting, add bagging; ie at each level of the stack, a bootstrap sample is taken for that tree’s training
28
Q

LDA (lc)

A
  • linear discriminant analysis, a linear model
  • offers dimensionality reduction
    • max principal directions, for K classes, will only ever be the minimum of {number of predictors, K-1}
  • can produce class probabilities (at least Welch form)
  • (Welch) via Bayes
    • class-conditional, (P(X=x|class j)) for instance x, equal-covariance Gaussian probability distributions are fit to each of K classes in the predictor phase space
    • combined with class priors, class predictions for a test point arise from finding highest probability class (considering all K Gaussians)
  • (Fisher)
    • variance-based, a la signal-to-noise
    • maximize the distance between the “centers” of different groups, while minimizing the variance of the data within groups
  • overall the model is quite fussy–collinearities, center/scale, low/zero-variance predictors
29
Q

PLSDA (lc)

A
  • partial least squares discriminant analysis
  • the algorithm applies PLS (from regression) to an “enhanced” outcome data type, a one-hot encoded “outcome matrix” (vs vector)
  • model produces K (or K-1) tuples of values, eg (-1.2,9.85,1.3); take max as class prediction
  • there are some techniques to obtain class probabilities
30
Q

nearest shrunken centroids (lc)

A
  • a linear model; aka predictive analysis for microarrays (PAM)
  • for K classes, locate the training set centroids for the K classes in predictor phase space
  • for a test instance, the predicted class is the closest centroid
  • the shrink level may be determined by subtracting a fixed amount from each predictor-coordinate
  • if a predictor dimension “collapses”–ie sees all centroids at its axis zero–then it is removed (feature selection)
  • the model is tuned over the “shrink” level (eg shrink ) of the centroids toward the origin
  • can produce probabilities and variable importance
  • favors centering and scaling first
31
Q

QDA / RDA (nc)

A
  • quadratic discriminant analysis (with or without regularization)
  • relax Welch LDA requirements of equal covariance matrices for Gaussians
  • the inter-class boundaries become quadratic
  • RDA
    • a kind of hybrid between LDA and QDA
    • a parameter linearly mixes between single covariance matrix for all classes (Welch LDA), and separate covariance matrices (QDA)
  • considerations:
    • need sufficient number of samples per class
    • avoid collinearities
    • too many one-hot/binary predictors can make QDA no better than LDA
32
Q

MDA (nc)

A
  • mixture discriminant analysis
  • assuming a fixed covariance matrix, model each class-conditional distribution with one *or more* Gaussian distributions
  • can be regularized (L1 and L2)
  • may not work well with more complex class boundaries, or a lot of binary / one-hot predictors
33
Q

FDA (nc)

A
  • flexible discriminant analysis, a kind of regularized LDA; non-linear (in general)
  • instead of Welch’s fitting class conditionals with same-covariance-matrix Gaussians, fit with a more flexible (non-parametric) model, like MARS
  • the inner model (like MARS) acts, among other things, like a basis expansion, which then has LDA applied to it
  • more flexible than LDA, with more complex boundaries, but may overfit
34
Q

SVM (nc)

A
  • at the core, a discriminant function, D(u) = B0 + B’*dot*u, with B normal to separating hyperplane
  • as the model is trained B’ “becomes” a matrix of weighted training set instances, called the support vectors (in the separable case, these amount to instances on the boundary-margin hyperplanes)
  • the kernel trick
    • allows subbing B’*dot*u with a sum over (support vector) training set instances, each “wrapped” in a symmetric positive definite kernel, then weighted
    • effectively amounts to a basis expansion
  • reqs:
    • center and scale
    • can be negatively affected by non-informative predictors
    • class probabilities are not “native”
35
Q

boosting (trees) (nc)

A
  • for non-gradient, eg for trees / AdaBoost:
    • samples, along with stack-level trees are each weighted (stage weights)
    • stage weights reflect that stack-level tree’s overall classification error
    • sample weights are updated according to how well/poorly the tree at that level classifies the sample
    • to make a prediction, sum over all tree levels, with each level’s class prediction multiplied by its stage weight (for overall <>0)
  • for gradient-based methods, similar to AdaBoost but would:
    • include a loss function, such as multinomial deviance
    • train the next tree in the stack on targets derived from gradient of loss function