Chapter 4 - Classification Flashcards
Logistic Regression, LDA,
Overview (type of learning method, prediction error metric, how different from linear)
supervised learning with a qualitative or categorical response. Bayes classifier minimizes the 0-1 loss. given x0 we predict the response yhat0. however, if y is binary then we cannot use a linear model
Classification Methods
Logistic Regression, LDA, QDA, KNN-1, KNN-CV
Logistic Regression
a way to model the joint probability of P(X,Y). Y is binary. log(P(Y=1|X)/P(Y=0|X)) = B0 + B1X + … + BpXp. The left side of the equation is called the log odds. What do the coefficients mean? For every unit Xi increase we will add Bi.
Fitting Logistic Regression
we cannot observe the log odds, so we can’t do least squares.
Solution: we deal with a quantity called likelihood. choose Beta estimates that maximize the likelihood. you solve this with numerical methods (Newton’s Method). glm(…, family = binomal).
Estimating Standard Error in Logistic Regression
There is a z statistic, equivalent to the t-statistic in linear regression. p values are the test of the null hypothesis (Wald’s test). other possible hypothesis tests: likelihood ratio test (chi-squared tests).
Multinomial Linear Regression
suppose Y takes values in {1,2,…,k}, then we use a linear model for the log odds against a baseline category. The fit will not change with choice of baseline, but the interpretation of results will.
Issues with logistic regression (2)
coefficients become unidentifiable with collinearity. When classes are well separated, the coefficients become unstable (it prefers overlapping classes)
LDA
linear discriminant analysis. Instead of estimating P(Y|X), we will estimate:
1) Phat(X|Y) - given the response, what is the distribution of inputs
2) Phat(Y) - how likely are each of the categories
now we use the bayes rule to obtain the estimate:
Phat(Y = k | X = x) = P(X = x | Y = k)P(Y=k) / sum(P(X = x | Y = j)P(Y=j))
Estimating the two parts of LDA
1) we model Phat(X = x | Y = k) = fhat(x) as a multivariate normal distribution (an oval). the decision boundary, the lines we draw are LDA.
here we use a multivariate normal distribution
2) Phat(Y = k) = pi_hat(k) - fraction of the training responses that have response k.
Why does LDA have linear decision boundaries? (what is mu_k, bold sigma)
suppose we know P(Y=k) = pi_k exactly, and we know that P(X = x|Y=k) is exactly multivariate normal (mu_k is the mean of inputs for category k, bold sigma is the covariance matrix, is common to all categories (shape and orientation of ovals is the same for each k)). derivation that ensues results in saying that the objective function sigma_k(x) is linear with x, so the decision boundary is linear.
How do we get estimate decision boundaries in LDA? (estimate pi_k, mu_k, bold sigma)
estimate pi_hat_k = #{i:y = k}/n.
estimate mu_k, the center of class k, the average input for all points in training sample which belong to class k
estimate common covariance matrix- compute a vector of deviations (x-uk) and use an unbiased estimate of its covariance matrix
QDA
comes from the classes not having the same covariance matrix. so we have to estimate uk and sigmak for each class. given an input it is easy to derive an objective function (slightly longer than for LDA). the last term has x*x so the decision boundaries are quadratic in x
How do we evaluate classification methods?
1) 0 - 1 loss (misclassification rate)
2) however it is possible to make the wrong prediction for some classes more frequently than others, so we use the confusion matrix (includes false positives, false negatives along with true pos and true neg).
If the error rate of a particular class is too high, you can change the threshold, but that will influence the error rates of other classes. SOLUTION: ROC curve
ROC curve
Receiver operating characteristics. the rate of true positives vs false positives (each point is a different threshold). displays the performance of the method The AUC measures classifier quality (0.5 means coin flip, the closer AUC is to 1 the better). should never be worse than 0.5 (if it is, just flip over y=x).
Loss function considerations
most regression problems try to minimize MSE, most classification problems try to minimize 0-1 loss. however, default problem showed us that we may only care about certain types of error (e.g., false positives). we can optimize our supervised learning methods to optimize our true loss fcs (e.g., find the threshold that brings rate of false negatives into an acceptable rate)