Section 2 Logistic Regression Flashcards by Sarah Oliver

What is w0 in linear regression

Parameter w0 is the intercept allowing for any fixed offset in the data. It is often called bias.

How well did you know this?

Not at all

Perfectly

How are the regression coefficients found in linear regression?

Maximisation of the log-likelihood is equivalent to minimization of the squared loss function will lead to the same optimisation.

How well did you know this?

Not at all

Perfectly

What is the motivation for logistic regression?

Logistic regression models a binary target categorical variables given a collection of predictors. The motivation for logistic regression is to map R to the 0,1 set to model Pr(Y=1 given the predictor X).Logistic regression provides an interpretable model where can understand how the traits of individual observations contribute to their classification.

How well did you know this?

Not at all

Perfectly

In linear regression what is the error assumption?

We assume errors in our linear regression are normal with zero mean and variance sigma

How well did you know this?

Not at all

Perfectly

Explain classification

Predicting the value of Y using inputs of X can be called classification since we assign the observation to a category, or class.

How well did you know this?

Not at all

Perfectly

What are linear regression models used for

Linear regression models relationships between numerical response and multiple predictors using a linear model (real set).This cannot be used for a binary/categorical response.

How well did you know this?

Not at all

Perfectly

What is logistic regression?

Logistic regression is used to model a binary categorical variable given a collection predictors. Maps the real numbers to just 0,1. The logistic regression model defines the probability p as a function of parameter and predictors in terms of the logistic function.

How well did you know this?

Not at all

Perfectly

Interpret w0

The intercept w0 is the value of the logit corresponding to xi = 0. The intercept w0 is sometimes called bias.

How well did you know this?

Not at all

Perfectly

Interpret W weights

The coefficient w measures the effect of the variable on the logit.

How well did you know this?

Not at all

Perfectly

R function to fit a logistic regression

glm() to fit logistic regression models. This is because it allows variables forming the regression to be of all different categories.

How well did you know this?

Not at all

Perfectly

What does R return using glm function?

estimates for weights parameters, std errors of these estimates, z value and p value

How well did you know this?

Not at all

Perfectly

Define a hypothesis test to determine if a variable has a significant effect on the target variable

To determine if a variable Xj has a significant effect on the target variable Y , we may wish to perform the hypothesis test:
H0:wj=0 vs HA wj not equal to 0

How well did you know this?

Not at all

Perfectly

Name three estimation options for parameter estimates

Maximum likelihood estimation, minimization of least squares loss function or minimising the mean squared error loss function.

How well did you know this?

Not at all

Perfectly

What is the response variable modelled as under logistic regression?

Response variable is modelled by a Bernoulli distribution given the values of the input variables.

How well did you know this?

Not at all

Perfectly

How is wj generally estimated?

Maximising the log-likelihood. or minimising loss functions. No closed form solution for wj is available, and optimization is performed numerically and software is used. In the GLM literature, maximisation of the logistic regression log-likelihood is performed using the Newton-Raphson algorithm.

How well did you know this?

Not at all

Perfectly

Why would mean squared error loss function not make sense for logistic regression data

Study These Flashcards

Mean squared error for a target variable that’s binary doesn’t make sense.

Explain information theory

Study These Flashcards

Information theory revolves around quantification of information of an event with respect to its likelihood of happening

Define entropy

Study These Flashcards

Entropy is used to quantify the information over the entire probability distribution P: Entropy allows us to measure and quantify the variability of a categorical variable, by giving a measure of how spread out the probability values are.

Define cross entropy

Study These Flashcards

Suppose P denotes a probability distribution of interest, while Q a probability distribution used to estimate P.
Cross-entropy measures the expected “surprisal” of an observer with probabilities Q after seeing data actually generated according to probabilities P (Replacing set of probabilities by another from a different distribution to quantify the difference between the two distributions):

What is the gradient of a loss function

Study These Flashcards

The gradient ∇l(w: D) of the loss function is the vector of all partial derivatives

Explain gradient descent concept

Study These Flashcards

Using the information from the gradient, a first-derivative-based algorithm can be devised to efficiently locate a local minimum by moving towards the direction of the negative gradient, as its always “downhill”. This iterative optimization is called gradient descent.

What is eta in gradient descent optimisation algorithm

Study These Flashcards

The learning rate η determines the size of the step and is usually set to be small: If steps are too big you risk missing the optimal point, if too small optimization will take a long time. Steps are not necessarily same very time but will be proportional to η

When does gradient descent algorithm converge

Study These Flashcards

The gradient descent algorithm converges when all the elements of the gradient are (numerically) zero.

Explain concept of complete separation

Study These Flashcards

Complete separation in logistic regression.A complete separation in a logistic regression, sometimes also referred as perfect prediction, happens when the outcome variable separates a predictor variable completely.

Logistic regression tries to fit a sigmoid curve to the data.
The coefficient w measures the slope of the curve. As the value of w increases to ∞, we get a better fit to completely separated data (clear vertical line/upward sloping line between the two data sections. )

How can we recognize complete separate of data

Software will output an arbitrarily large parameter estimate with a very large standard error and throw a warning . OR if in gradient descent algorithm we are getting the sigmoid function to be 1 or 0 and the algorithm gets stuck

How can one address problem of complete separation of data

Regularised logistic regression could be used to address the problem. It adds an extra term, to avoid the convergence in gradient descent getting stuck. See below.

What are the consequences of complete separation of data

Note that this represents a problem for inference on the coefficients, while it might not constitute a problem if the main purpose is classification. Always check as if there is complete separate you cannot make inference on coefficients.

What is logistic function saturation

This warning simply states that some of the estimated probabilities are numerically equal to 0 or 1. This happens when the logistic function “saturates”, that is when it attains numerically the boundary values (asymptotes) of 0 or 1. - means argument of the function is such large in magnitude that the value in output is numerically indistinguishable from 0 or 1

What could cause logistic function saturation? and what are the consequences?

Could be due to: large coefficient magnitude values, large observed xij magnitude values, or a combination. If due to large x ij values software will throw a warning but inference on estimated coefficients may still be valid.

What is multinomial logistic regression

Multinomial logistic regression is used for multi-class classification. Suppose that Y can take K attributes/classes/categories The model is formulated using the softmax function.

How do the weights work in multinomial logistic regression?

You will have a set of biases w0k and weights wk for each class. Vector w for each k.

How are parameters optimized for multinomial regression?

Like for logistic regression, optimization and estimation of parameters can be implemented using gradient descent.

What issues can arise with multinomial regression?

In multinomial logistic regression complete separation and saturation can happen but separation can be more difficult to see. It's very rare. Generally everything is intertwined and no two components are separated.

Error thrown for complete separation in R

Warning: 1. glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred.

Section 2 Logistic Regression Flashcards

(35 cards)