Machine Learning Flashcards
logistic regression
Below is an example logistic regression equation:
y = e^(b0 + b1x) / (1 + e^(b0 + b1x))
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.
random forest
an ensemble learning method that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
Bags both features (random subset) and trees (with replacement)
GINI impurity
a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
LDA
a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
bias variance tradeoff
the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
box-cox
power transformation
a useful data transformation technique used to stabilize variance, make the data more normal distribution-like
stochastic gradient descent
a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions
AIC
Akaike information criterion (AIC)
k = number of estimated parameters
L = max like
AIC =2k-2*ln(L)
Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models
BIC
Bayesian information criterion
BIC = ln(n)k-2ln(L)
k = num parameters L = max like n = num observations
DIC
deviance information criterion
D(theta)=-2*log(p(y | theta))+C
p_D = D_bar-D or p_D = 1/2 var(D(theta))
DIC= p_D +D_bar
SVM
a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier
kernel trick
The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x and x_prime in the input space chi, certain functions k(x,x_prime) can be expressed as an inner product in another space V. The function k: chi x chi -> R is often referred to as a kernel or a kernel function.
Odds ratio logistic regression
ln(p(X) / 1 – p(X)) = b0 + b1 * X
Left side is odds ratio
Logistic regression assumptions
Binary output variable
No error in output variable y (remove outliers first)
Linear model (with non-linear transform on output) must transform data for non linear (box cox, log, root)
Must remove correlated inputs (use pairwise distance metric, correlation)
Fails to converge if too many colinear or data sparse
Observations independent
Large sample
Logit
Inverse of logistic (sigmoid)
Log odds when logistic represents a probability