Logistic Regression Flashcards

Question 1

Q

What parameters can be tuned in Logistic Regression models? Explain how they effect model learning.

Answer

A

Logistic regression models can be tuned using regularization techniques (commonly L2 norm, but other norms as well).

Question 2

Q

Can gradient descent get stuck at local minima when training a logistic regression model, Why?

Answer

A

No, gradient descent will not get stuck at local minima because the cost function is convex.

Question 3

Q

Can Logistic Regression produce a probability score along with its classification prediction?

Question 4

Q

Is Logistic Regression a regressor or a classifier?

Answer

A

Logistic regression is usually used as a classifier because it predicts discrete classes.

Having said that, it technically outputs a continuous value associated with each prediction.

So, we see that Logistic Regression is actually a regression algorithm, hence the name, that can solve classification problems,

It is fair to say that it is a classifier because it is used for classification, although technically it is also a regressor.

Question 5

Q

Explain how Logistic Regression works to a five year old

Answer

A

Logistic regression is similar to linear regression, except logistic predicts whether something is True/False.

e.g. target = {obese, not obese}, X = [weight]

Instead of fitting a line through the X features data, logistic fits a logistic func S-shaped curve between 0,1.

Decision threshold is if proba > 0.5 that mouse is obese, then predict Obese True

Logistic reg does NOT have same concept as “residuals” so it can’t use least squares and it can’t compute Rsquared.

Instead logistic uses maximum likelihood:

Max-likelihood in a nutshell:

pick a proba, along the sigmoid func given a feature [weight] vs. target class of observing an obese mouse, then use that to calc the likelihood of observing a non-obese mouse that has weight at that proba.

Do the above max-likelihood calc step for all xi observations; then to obtain the likelihood of the data given the sigmoid curve, take the product of the iniv likelihoods.

Then do many times: shift the sigmoid line and compute a new likelihood of the data.

Finally, the curve with the max-likelihood is selected and that is the optimal sigmoid curve.

Question 6

Q

Can we use the metric R2 in logistics regression?

Answer

A

NO.

Logistic reg does NOT have the concept of a residual, so it CANNOT use least squares so it CANNOT calculate R2.

Instead, logistic reg uses MAXIMUM LIKELIHOOD.

In nutshell:

given an S-shaped SIGMOID curve which assigns a probability of observing a class in y given xi.
for all yi, xi in X,y, use the sigmoid to calculate the LIKELIHOOD of observing a class in y given data xi.
multiply all the data X, y LIKELIHOODS together to obtain the likelihood given that particular SIGMOID curve.

Question 7

Q

How are coefficients determined and interpreted in logistic regression?

Answer

A

In linear reg, the X,y values are unconstrained, but in logistic reg, y are bound to [0, 1].

To solve this problem, the y-axis in logistic reg is TRANSFORMED from “the probability of a class” to the “log (odds of a class)” so that just like in linear reg, it can range from [-inf, +inf].

To transform a class PROBABILITY to LOG-ODDS, apply the LOGIT function to p:

log(odds of positive class in y) = log(p /(1-p))

where p is proba of positive class in [0, 1].

The logit function transforms positive class proba p to new y axis values as follows:

p = 0 ==> log(0/1-0)) = -inf
…
p = 0.5 ==> log(.5 /1-.5)) = log(1) = 0
p = .731 ==> log(.731 / 1-.731)) = log(2.717) = 1
p = .88 ==> log(.88 / 1-.88)) = log(7.33) = 2
p = .95 ==> log(.95 / 1-.95)) = log(19) = 3
p = 1 ==> log(1 /0) = log(1)-log(0) = +inf

The above maps positive class proba p to new scale of log-odds in range [-inf, inf].

i.e., y-axis is log(odds of positive class) vs. X and is a straight LINE JUST LIKE LINEAR REGRESSION.

And just like in linear reg, the best fit has y-axis intercept b0 and a slope b1, which correspond to the log-odds line coefficients:

b0 is log-odds y-axis INTERCEPT, which occurs when feature X is ZERO.

b1 is log-odds slope, and says for every one unit increase in x, the log(odds of positive class) increases by b1.

Question 8

Q

What are odds?

Answer

A

Odds are NOT probabilities.

The odds are the RATIO of something HAPPENING over something NOT happening.

(probability is the ratio of something happening over EVERYTHING THAT COULD HAPPEN)

odds of 1 to 4 means for every 5 games, a win will occur 1 time.

Alternatively, we can write this as a fraction:
1 win/ 4 losses
= 1/4

the odds are 0.25 that we win a game.

another example: “odds in FAVOR of my team winning a game is 5:3”.

then odds = 5/3 = 1.7 (while P(win) = 5/8).

Note:
Odds = P(win) /P(lose)
Odds = p / (1-p)

e.g. P(win) = 5/8, then P(lose) = 1-P(win) = 3/8
odds = (5/8) /(3/8) = 5/3

If odds are FOR my team winning, then odds in [1, +inf]

If odds are AGAINST my team winning, then odds in (0, 1).

Question 9

Q

What is LOG odds and why do we care?

Answer

A

Recall the range of values that odds for and against winning:

odds AGAINST is in (0,1), from smaller numerator

odds FOR is in [1, +inf] from larger numerator

so we have a large right skew in odds.

Taking LOG(odds) solves the issue of skew and makes everything symmetrical.

e.g.
if odds are 1:6 AGAINST, log(1/6) = -1.79
if odds are 6:1 FOR, log(6/1) = +1.79
thus, the distance from the origin is now the SAME

Whats significant about LOG of odds?

e.g.
if we pick PAIRS of random numbers that add up to 100, and use them to calculate LOG(odds) then draw a historgram, that hist is in the shape of a NORMAL DIST!

This makes the LOG(odds) useful for solving certain stats problems - specifically ones which we need to determine probas about BINARY win/lose, yes/no, true/false situations.

NOTE: the LOG of the RATIO of probas is called LOGIT FUNCTION, which forms the BASIS of LOGISTIC REGRESSION.

Question 10

Q

What is odds ratio and why care?

Answer

A

When we say “odds ratio”, it means RATIO of ODDS.

Say we have cancer and cell mutation data:

                   cancer  no cancer mutated yes    23          117 mutated no      6            210

If someone has mutated gene, are ODDS higher they get cancer?

odds(cancer, mutated) = 23/117 = 0.2
odds(cancer, no mutated) = 6/210 = 0.03

odds ratio = 0.2/0.03 = 6.88
log(odds ratio) = log(6.88) = 1.93

odds ratio reveals someone with mutated gene is 6.88 MORE LIKELY to HAVE CANCER

The odds ratio and LOG odds ratio are like R2, they indicated a RELATIONSHIP between mutated gene and cancer.

Just like R2, the values correspond to EFFECT size.

How do we know if odds ratio or log odds ratio is STATISTICALLY significant?

Fisher’s Exact Test (p value)
Chi Squared Test (p value)
Wald test (p value, CI)

Wald test computes the number of std the observed log(odds) is from zero, i.e. LOG(odds)/std = # std from zero. If