Classification Flashcards

1
Q

What is classification?

A

approaches for predicting qualitative responses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why isn’t linear regression appropriate for classification problems?

A

Linear regression implies a specific ordering between outcomes and that the differences between outcomes outcomes is the same. But with qualitative outcomes you can order them however you want but different orderings would produce different results in linear regression.

Also, linear regression would make estimates outside of the two possibilities with a binary outcome.

1) regression methods cannot accommodate a qualitative response with more than two classes
2) regression methods will not provide meaningful estimates of Pr(Y|X), even with just two classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does logistic regression model?

A

The probability that Y belongs to a particular category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the logistic function?

A

p(X) = eβ0+β1X / 1 + eβ0+β1X .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is used to fit a logistic regression model?

A

Maximum likelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is maximum likelihood estimation?

A

MLE estimates parameters in a model so that the predicted probabilities of the outcome are a close as possible to the actual outcomes. So for a binary outcome variable, it is trying to make each probability as close to 1 or 0 as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between generative and discriminative classifiers?

A

Discriminative classifiers learn what features in the input are most useful to distinguish between the various possible classes. For example, if given images of dogs and cats, and all dog images have a collar, the discriminative models will learn that having a collar means in the image is of a dog. Mathematically, they directly calculate the posterior probability P(y | x). (logistic regression is an example of a discriminative classifier)

Generative classifiers model how a particular class would generate input data. When a new observation is given to these classifiers, they try to predict which class would have most likely generated the given observation. Mathematically, generative models try to learn the joint probability distribution, p(x,y), of the inputs x and label y, and make their prediction using Bayes rule to calculate the conditional probability, p(y|x), and then picking a most likely label. Thus, it tries to learn the actual distribution of the class.

Generative models try to recreate the model that generated the data by estimating the assumptions and distributions of the model. Discriminiative models are built based only on the observed data and include less assumptions on the distribution of the data.

Bayes Thereom: P (A | B) = P(B | A) * P(A) / P(B) (generative models try to capture P(B | A) to get at P(A | B)

Both methods use conditional probability to classify but learn different types of probabilities to generate conditional probability. Discriminative classifiers try to directly solve the classification task, rather than trying to solve a general problem as an intermediate step as generative models do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bayes’ Theorem?

A

It is an alternate way to calculate conditional probability. The standard formula for conditional probability is:

Conditional probability
P(A | B) = P(A,B) / P(B)
P(B | A) = P(A,B) / P(A)

Joint probability:
P(A,B) = P(A | B) * P(B) or P(B | A) * P(A)

Thus, P(A | B) can be expressed via P(B | A). Specifically…

P(A | B) = P(B | A) * P(A) / P(B) = Bayes’ Theorem

This means that when we want to calculate the conditional probability and the joint probability is challenging to calculate, we can use the reverse conditional probability if it is available.

Bayes Theorem: Principled way of calculating a conditional probability without the joint probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name the parts of Bayes’ Theorem.

P (A | B) = P(B | A) * P(A) / P(B)

Posterior probability
prior probability
likelihood
evidence

A

Posterior probability = P (A | B), How probable is our hypothesis given the observed evidence? (not directly computable)
Prior likelihood = P(A), How probable was our hypothesis before observing the evidence?
Likelihood = P(B | A), How probable is the evidence given that our hypothesis is true?
Evidence = P(B), How probable is the new evidence under all possible hypotheses?

Posterior = Likelihood * Prior / Evidence

For example, what is the probability that there is a fire given that there is smoke?

P(Fire | Smoke) = P(Smoke | Fire) * P(Fire) / P(Smoke)
Probability of fire given smoke = chances that there smoke with a fire divided by chances there is only smoke
Probability = reverse joint probability / marginal probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the posterior probability?

A

the probability that an observation belongs to the kth class, given the predictor value for that observation

Pr(Y = k | X = x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What formula is referred to as the “odds”?

If the odds of defaulting are 1/4, how many people will default on average?

A

p(x) / (1 - p(x))
The ratio of something occurring vs. not occurring

1 out of 5 will default because 1/4 = 0.2/ (1 - 0.2)
Since 0.2 is the numerator this means the chances of defaulting is 1 out of 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why do we use the logistic function in logistic regression?

A

Because we need to model the relationship between predictor and outcome with a function that gives outputs between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we need generative models? Why not just always use discriminatory models for classification?

A
  1. When there is substantial separation between the two classes, the parameter estimates for logistic regression can be unstable. This is not so for generative models.
  2. If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
  3. Generative methods can be naturally extended to the case of more than two response classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is linear discriminant analysis (LDA)?

A

A method that approximates the Bayes classifier by generating estimates for parameters.

In the absence of additional information, LDA estimates the prior probability that an observation belongs to the kth class using the proportion of the training observations that belong to the kth class.

A discriminant rule tries to divide the data space into K disjoint regions that represent all the classes of an outcome variable. With these regions, classification by discriminant analysis simply means that we allocate x to class j if x is in region j.

LDA attempts to find a linear line that separates predictor values by the class of the outcome variable. It finds the line that maximizes the distance between mean values across predictors and minimizes the variance(scatter) within predictors.

it reduces the dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a “decision boundary”?

A

A decision boundary refers to values of a predictor(s) where the discriminant functions have the same value. In other words, any data that falls on the decision boundary is equally likely from two different classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What distribution does linear discriminant analysis assume for predictor variables?

A

Gaussian/normal distribution for 1 predictor

Multivariate Gaussian/normal distribution for predictors > 1

17
Q

What is the Bayes Classifier?

Why don’t we always use the Bayes Classifier?

A

A very simple classifier that assigns each observation to the most likely class, given its predictor values. Must know the conditional probabilities.

If P(Y =1 | X = x) > 0.5, then observation x would be assigned to class Y = 1

We don’t use this for real data because we do not know the conditional distributions of Y given X. We use methods that attempt to estimate the conditional distributions before classifying observations. Thus, the Bayes classifier is an unattainable gold standard against which to compare other methods.

18
Q

PERFORMANCE OF LDA

What is an ROC curve?

What does “sensitivity” refer to?

What does “specificity” refer to?

A

A popular graphic for simultaneously displaying the two types of errors (false positive rate vs true positive rate

Name stands for “receiver operating characteristics”

Sensitivity - percentage of true positives

Specificity - percentage true negatives

The ROC curve models sensitivity on the y-axis and 1-specificity on the x-axis. Ideal graph hugs the top left corner

19
Q

What is type I and type II error?

A

Type I error = False positive; 1-specificity (true positive rate)

Type II error = False negative;

20
Q

What is quadratic discriminant analysis (QDA)?

A

A classifier similar to LDA but unlike LDA it does not assume that each class has the same covariance matrix.

21
Q

How does LDA compare to QDA in terms of the bias-variance tradeoff?

A

Since QDA calculates a separate covariance matrix for each class, it calculates a high number of parameters when the number of predictors is really high. LDA doesn’t do this since it assumes each class shares a covariance matrix. This means that QDA is much more flexible than LDA. Thus, if LDA’s prediction that each class shares a covariance matrix is off, then LDA suffers from high bias. Roughly speaking, LDA tends to be better than QDA is there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the k classes is clearly untenable.

LDA less flexible, greater risk of bias but less variance, (obviously does better if covariance matrices are actually gaussian and equal)
QDA more flexible, greater risk of variance but lower risk of bias

22
Q

How does naive bayes compare to LDA & QDA when it comes to calculating the density functions of predictors for each class?

A

Naive bayes assumes that predictors are not associated. Therefore, it is much easier to calculate the density function (likelihood of x given class k). If predictors are associated, then density functions have to consider marginal distributions and joint distributions of predictors. Assuming that the p covariates are independent eliminates the need to worry about association because we have just assumed there isn’t one.

23
Q

When is naive bayes a good choice to use for classification?

A

Naive bayes often leads to decent results when n is not large enough relative to p. This is because it is really hard to calculate joint distributions of the predictors within each class without a large sample size. In fact, naive bayes is a good choice in many situations given the huge amount of data needed to estimate joint distributions. Naive bayes introduces some bias but eliminates variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade off.

24
Q

So, which classifier is better between KNN, LDA, QDA, logistic, and naive bayes?

A

It depends.

If true decision boundaries are linear, LDA and logistic tend to perform well.

If boundaries are moderately non-linear, QDA or naive bayes may be better.

For much more complicated decision boundaries, a non-parametric approach like KNN might be best.

25
Q

Can you use KNN with mixed data (categorical and numerical)?

A

https://www.quora.com/How-can-I-use-KNN-for-mixed-data-categorical-and-numerical

You can use KNN by converting the categorical values into numbers.

But it is not clear that you should. If the categories are binary, then coding them as 0–1 is probably okay. But as soon as you get more than two categories, things get problematic. If the values are “Low”, “Intermediate”, and “High” (or more generally, if they at least have a natural order), then you can again make sense of coding them numerically as 1, 2, 3. But if the values are “Red”, “Green”, “Blue” (or more generally, something that has no intrinsic order), then simply coding them as integers won’t work. One possibility in the case is to put them equally spaced around a circle., since then the distance between any pair of them is the same. With 𝑁>3 values , you may want to put them at the vertices of a regular simplex in 𝑁 -dimensional space. In other words, code them as (1,0,…,0), (0,1,0,…,0), …, (0, 0, .., 0, 1). Then the distance between any pair of values is the same.

The key thing to think about is exactly “what is the appropriate definition of distance for my data?” By definition, KNN uses Euclidean distances. Depending on the structure of your combination of numerical and categorical data, this may or may not be reasonable.

26
Q

What is “one hot encoding”?

Why use this?

A

One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. One hot encoding is a crucial part of feature engineering for machine learning

One hot encoding is useful for data that has no relationship to each other. Machine learning algorithms treat the order of numbers as an attribute of significance. In other words, they will read a higher number as better or more important than a lower number.

While this is helpful for some ordinal situations, some input data does not have any ranking for category values, and this can lead to issues with predictions and poor performance. That’s when one hot encoding saves the day.

One hot encoding makes our training data more useful and expressive, and it can be rescaled easily. By using numeric values, we more easily determine a probability for our values. In particular, one hot encoding is used for our output values, since it provides more nuanced predictions than single labels.

One hot encoding is an example of feature engineering.

https://www.educative.io/blog/one-hot-encoding#what