Machine Learning Flashcards

1
Q

logistic regression

A

Below is an example logistic regression equation:

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))

Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the single input value (x). Each column in your input data has an associated b coefficient (a constant real value) that must be learned from your training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

random forest

A

an ensemble learning method that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Bags both features (random subset) and trees (with replacement)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

GINI impurity

A

a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

LDA

A

a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

bias variance tradeoff

A

the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

box-cox

A

power transformation

a useful data transformation technique used to stabilize variance, make the data more normal distribution-like

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

stochastic gradient descent

A

a stochastic approximation of the gradient descent optimization and iterative method for minimizing an objective function that is written as a sum of differentiable functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AIC

A

Akaike information criterion (AIC)

k = number of estimated parameters
L = max like

AIC =2k-2*ln(L)

Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BIC

A

Bayesian information criterion

BIC = ln(n)k-2ln(L)

k = num parameters
L = max like
n = num observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DIC

A

deviance information criterion

D(theta)=-2*log(p(y | theta))+C

p_D =  D_bar-D or
p_D = 1/2 var(D(theta))

DIC= p_D +D_bar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SVM

A

a hyperplane or set of hyperplanes in a high- or infinite-dimensional space. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

kernel trick

A

The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x and x_prime in the input space chi, certain functions k(x,x_prime) can be expressed as an inner product in another space V. The function k: chi x chi -> R is often referred to as a kernel or a kernel function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Odds ratio logistic regression

A

ln(p(X) / 1 – p(X)) = b0 + b1 * X

Left side is odds ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic regression assumptions

A

Binary output variable

No error in output variable y (remove outliers first)

Linear model (with non-linear transform on output) must transform data for non linear (box cox, log, root)

Must remove correlated inputs (use pairwise distance metric, correlation)

Fails to converge if too many colinear or data sparse

Observations independent

Large sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Logit

A

Inverse of logistic (sigmoid)

Log odds when logistic represents a probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Probit model

A

type of regression where the dependent variable can take only two values, for example married or not married

17
Q

k means

A

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

18
Q

extra trees

A

same as RF except every split of the tree is random based on that random subsets range