Classification Flashcards
What is classification?
approaches for predicting qualitative responses
Why isn’t linear regression appropriate for classification problems?
Linear regression implies a specific ordering between outcomes and that the differences between outcomes outcomes is the same. But with qualitative outcomes you can order them however you want but different orderings would produce different results in linear regression.
Also, linear regression would make estimates outside of the two possibilities with a binary outcome.
1) regression methods cannot accommodate a qualitative response with more than two classes
2) regression methods will not provide meaningful estimates of Pr(Y|X), even with just two classes.
What does logistic regression model?
The probability that Y belongs to a particular category
What is the logistic function?
p(X) = eβ0+β1X / 1 + eβ0+β1X .
What is used to fit a logistic regression model?
Maximum likelihood
What is maximum likelihood estimation?
MLE estimates parameters in a model so that the predicted probabilities of the outcome are a close as possible to the actual outcomes. So for a binary outcome variable, it is trying to make each probability as close to 1 or 0 as possible.
What is the difference between generative and discriminative classifiers?
Discriminative classifiers learn what features in the input are most useful to distinguish between the various possible classes. For example, if given images of dogs and cats, and all dog images have a collar, the discriminative models will learn that having a collar means in the image is of a dog. Mathematically, they directly calculate the posterior probability P(y | x). (logistic regression is an example of a discriminative classifier)
Generative classifiers model how a particular class would generate input data. When a new observation is given to these classifiers, they try to predict which class would have most likely generated the given observation. Mathematically, generative models try to learn the joint probability distribution, p(x,y), of the inputs x and label y, and make their prediction using Bayes rule to calculate the conditional probability, p(y|x), and then picking a most likely label. Thus, it tries to learn the actual distribution of the class.
Generative models try to recreate the model that generated the data by estimating the assumptions and distributions of the model. Discriminiative models are built based only on the observed data and include less assumptions on the distribution of the data.
Bayes Thereom: P (A | B) = P(B | A) * P(A) / P(B) (generative models try to capture P(B | A) to get at P(A | B)
Both methods use conditional probability to classify but learn different types of probabilities to generate conditional probability. Discriminative classifiers try to directly solve the classification task, rather than trying to solve a general problem as an intermediate step as generative models do.
What is Bayes’ Theorem?
It is an alternate way to calculate conditional probability. The standard formula for conditional probability is:
Conditional probability
P(A | B) = P(A,B) / P(B)
P(B | A) = P(A,B) / P(A)
Joint probability:
P(A,B) = P(A | B) * P(B) or P(B | A) * P(A)
Thus, P(A | B) can be expressed via P(B | A). Specifically…
P(A | B) = P(B | A) * P(A) / P(B) = Bayes’ Theorem
This means that when we want to calculate the conditional probability and the joint probability is challenging to calculate, we can use the reverse conditional probability if it is available.
Bayes Theorem: Principled way of calculating a conditional probability without the joint probability
Name the parts of Bayes’ Theorem.
P (A | B) = P(B | A) * P(A) / P(B)
Posterior probability
prior probability
likelihood
evidence
Posterior probability = P (A | B), How probable is our hypothesis given the observed evidence? (not directly computable)
Prior likelihood = P(A), How probable was our hypothesis before observing the evidence?
Likelihood = P(B | A), How probable is the evidence given that our hypothesis is true?
Evidence = P(B), How probable is the new evidence under all possible hypotheses?
Posterior = Likelihood * Prior / Evidence
For example, what is the probability that there is a fire given that there is smoke?
P(Fire | Smoke) = P(Smoke | Fire) * P(Fire) / P(Smoke)
Probability of fire given smoke = chances that there smoke with a fire divided by chances there is only smoke
Probability = reverse joint probability / marginal probability
What is the posterior probability?
the probability that an observation belongs to the kth class, given the predictor value for that observation
Pr(Y = k | X = x)
What formula is referred to as the “odds”?
If the odds of defaulting are 1/4, how many people will default on average?
p(x) / (1 - p(x))
The ratio of something occurring vs. not occurring
1 out of 5 will default because 1/4 = 0.2/ (1 - 0.2)
Since 0.2 is the numerator this means the chances of defaulting is 1 out of 5
Why do we use the logistic function in logistic regression?
Because we need to model the relationship between predictor and outcome with a function that gives outputs between 0 and 1.
Why do we need generative models? Why not just always use discriminatory models for classification?
- When there is substantial separation between the two classes, the parameter estimates for logistic regression can be unstable. This is not so for generative models.
- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
- Generative methods can be naturally extended to the case of more than two response classes
What is linear discriminant analysis (LDA)?
A method that approximates the Bayes classifier by generating estimates for parameters.
In the absence of additional information, LDA estimates the prior probability that an observation belongs to the kth class using the proportion of the training observations that belong to the kth class.
A discriminant rule tries to divide the data space into K disjoint regions that represent all the classes of an outcome variable. With these regions, classification by discriminant analysis simply means that we allocate x to class j if x is in region j.
LDA attempts to find a linear line that separates predictor values by the class of the outcome variable. It finds the line that maximizes the distance between mean values across predictors and minimizes the variance(scatter) within predictors.
it reduces the dimension
What is a “decision boundary”?
A decision boundary refers to values of a predictor(s) where the discriminant functions have the same value. In other words, any data that falls on the decision boundary is equally likely from two different classes.