models Flashcards
Provost’s 9 main model categories
clustering / segmentation (u)
classification (s)
regression (s)
similarity matching (s, u)
co-occurrence grouping (u)
profiling (u)
link prediction (s, u)
data reduction (s, u)
causal modeling
linear discriminant
a hyperplanar discriminant for a binary target variable will split the attribute phase space into 2 regions
fitting:
* we can apply entropy measure to the two resulting segments, to check for information gain (weighting each side by the number of instances in it)
* we can check the means of each of the classes along the hyperplane normal, and seek maximum inter-mean separation
probability estimation tree
a classification tree that may be considered a hybrid between classification and regression models
leaves are annotated with a category value, and a probability
decision tree (general)
for regression or classification
tunable via
- minimum leaf size
- number of terminal leaves allowed
- number of nodes allowed
- tree depth
support vector machines (linear)
simplest case involves a hyperplanar fitting surface, in combination with L2 regularization, and possibly a hinge loss function
via the kernel trick, more sophsiticated fitting surfaces can be used
support vectors consist of a subset of the training instances used to fit the model
logistic regression
aka logit model; typically used for modeling binary classification probabilities
in simplest form:
- a simple linear regression model in a sigmoid wrapper: 1/(1+exp(M)) where M is the linear regression model (ie linear hyperplane scalar field over the attribute phase space)
- this is a generalized linear model, under the transform log(p/(1-p)) = multiple_regression_model
for a special logistic loss function, the loss surface is convex, allowing steepest descent
can be regularized on coefficients of linear kernel, via L1 and/or L2
offers a linear model’s interpretability, with a linear models drawbacks (eg collinearities)
hierarchical clustering
under some (cluster) metric, find the two closest clusters, and merge them; iterate
the cluster metric is called the linkage function
centroid clustering
each cluster is represented by its cluster center, or centroid
k-means method
choose starting centers for k clusters in the predictor phase space, then iterate (can be tuned over different k):
* assign each instance to the cluster it’s closest to
* calculate the centroid of each of the resulting clusters
naive Bayes
for classification
generative; features are considered for giving evidence for or against target variable values; each instance gets its own pdf
allows instant updating, with new data (Bayesian property)
relies on the class (c) as the prior, with the instance the conditioning event: p(E|C=c) = p(C=c|E)p(C=c) / p(E)
probability of class C=c, given instance E, where e_i are individual instance-predictor values or ranges:
- p(C=c|E) = p(e_1|c)…p(e_k|c)p(C=c) / p(E)
- this assumes strong independence of effect of individual predictors on class values
- without the independence assumption, p(C=c|E) is very hard to compute (“sparseness” of individual instances)
p(E)
- can be difficult to compute accurately, so naive Bayes may leave it out, yielding a ranking classifier (ie relative class confidence vs true probabilities)
- however, a full formula does exist, which includes p(E)
further simplified (with p(E) decomposed), to put in terms of predictor lift: p(c|E) = p(e_1|c)…p(e_k|c)p(c) / p(e_1)…p(e_k)
remove near- and zero-variance predictors; careful of few-unique-value predictors (give weird pdfs)
non-parametric regression models
- no parametric form is assumed for the relationship between predictors and dependent variable
- the predictor does not take a predetermined form but is constructed according to information derived from the data; eg KNN, MARS
generalized linear models
- a family of models where a function of the outcome variable follows a (basic) linear regression model
- eg log(p/(1-p)) follows a linear regression model in logistic regression
parsimonious model
a model that accomplishes the desired level of explanation or prediction with as few predictor variables as possible
linear regression (lr)
- aka OLS; fit a hyperplane to the outcome variable, using a least squares condition
- solves normal equations, and gives fit statistics (p-value on coefficients, overall R^2, etc.)
- does not handle collinearities well (ill-conditioned or non-invertible matrices)
- note
- for multiple predictors it’s called multiple linear regression
- for multiple outcome variables, it’s multivariate linear regression
PCR (lr)
- PCA can be conducted prior to fitting a linear regression model, hence PCR
- some cutoff (eg in the scree plot) is set, to retain only the most “important” (orthogonal) PCA components, and the model is trained on them
- PCA “in the limit” tends to perform about as well as partial least squares (PLS)
PLS (lr)
- partial least squares, a supervised dimension reduction procedure; takes into account both (a) collinearities, and (b) component effect on outcome variable
- assumes a linear fit under the hood; the presence / abundance of nonlinear relationships can produce problems with PLS
- PLS may have trouble with non-informative predictors