Section 2 Logistic Regression Flashcards
What is w0 in linear regression
Parameter w0 is the intercept allowing for any fixed offset in the data. It is often called bias.
How are the regression coefficients found in linear regression?
Maximisation of the log-likelihood is equivalent to minimization of the squared loss function will lead to the same optimisation.
What is the motivation for logistic regression?
Logistic regression models a binary target categorical variables given a collection of predictors. The motivation for logistic regression is to map R to the 0,1 set to model Pr(Y=1 given the predictor X).Logistic regression provides an interpretable model where can understand how the traits of individual observations contribute to their classification.
In linear regression what is the error assumption?
We assume errors in our linear regression are normal with zero mean and variance sigma
Explain classification
Predicting the value of Y using inputs of X can be called classification since we assign the observation to a category, or class.
What are linear regression models used for
Linear regression models relationships between numerical response and multiple predictors using a linear model (real set).This cannot be used for a binary/categorical response.
What is logistic regression?
Logistic regression is used to model a binary categorical variable given a collection predictors. Maps the real numbers to just 0,1. The logistic regression model defines the probability p as a function of parameter and predictors in terms of the logistic function.
Interpret w0
The intercept w0 is the value of the logit corresponding to xi = 0. The intercept w0 is sometimes called bias.
Interpret W weights
The coefficient w measures the effect of the variable on the logit.
R function to fit a logistic regression
glm() to fit logistic regression models. This is because it allows variables forming the regression to be of all different categories.
What does R return using glm function?
estimates for weights parameters, std errors of these estimates, z value and p value
Define a hypothesis test to determine if a variable has a significant effect on the target variable
To determine if a variable Xj has a significant effect on the target variable Y , we may wish to perform the hypothesis test:
H0:wj=0 vs HA wj not equal to 0
Name three estimation options for parameter estimates
Maximum likelihood estimation, minimization of least squares loss function or minimising the mean squared error loss function.
What is the response variable modelled as under logistic regression?
Response variable is modelled by a Bernoulli distribution given the values of the input variables.
How is wj generally estimated?
Maximising the log-likelihood. or minimising loss functions. No closed form solution for wj is available, and optimization is performed numerically and software is used. In the GLM literature, maximisation of the logistic regression log-likelihood is performed using the Newton-Raphson algorithm.
Why would mean squared error loss function not make sense for logistic regression data
Mean squared error for a target variable that’s binary doesn’t make sense.
Explain information theory
Information theory revolves around quantification of information of an event with respect to its likelihood of happening
Define entropy
Entropy is used to quantify the information over the entire probability distribution P: Entropy allows us to measure and quantify the variability of a categorical variable, by giving a measure of how spread out the probability values are.
Define cross entropy
Suppose P denotes a probability distribution of interest, while Q a probability distribution used to estimate P.
Cross-entropy measures the expected “surprisal” of an observer with probabilities Q after seeing data actually generated according to probabilities P (Replacing set of probabilities by another from a different distribution to quantify the difference between the two distributions):
What is the gradient of a loss function
The gradient ∇l(w: D) of the loss function is the vector of all partial derivatives
Explain gradient descent concept
Using the information from the gradient, a first-derivative-based algorithm can be devised to efficiently locate a local minimum by moving towards the direction of the negative gradient, as its always “downhill”. This iterative optimization is called gradient descent.
What is eta in gradient descent optimisation algorithm
The learning rate η determines the size of the step and is usually set to be small: If steps are too big you risk missing the optimal point, if too small optimization will take a long time. Steps are not necessarily same very time but will be proportional to η
When does gradient descent algorithm converge
The gradient descent algorithm converges when all the elements of the gradient are (numerically) zero.
Explain concept of complete separation
Complete separation in logistic regression.A complete separation in a logistic regression, sometimes also referred as perfect prediction, happens when the outcome variable separates a predictor variable completely.
Logistic regression tries to fit a sigmoid curve to the data.
The coefficient w measures the slope of the curve. As the value of w increases to ∞, we get a better fit to completely separated data (clear vertical line/upward sloping line between the two data sections. )
How can we recognize complete separate of data
Software will output an arbitrarily large parameter estimate with a very large standard error and throw a warning .
OR
if in gradient descent algorithm we are getting the sigmoid function to be 1 or 0 and the algorithm gets stuck
How can one address problem of complete separation of data
Regularised logistic regression could be used to address the problem. It adds an extra term, to avoid the convergence in gradient descent getting stuck. See below.
What are the consequences of complete separation of data
Note that this represents a problem for inference on the coefficients, while it might not constitute a problem if the main purpose is classification.
Always check as if there is complete separate you cannot make inference on coefficients.
What is logistic function saturation
This warning simply states that some of the estimated probabilities are numerically equal to 0 or 1.
This happens when the logistic function “saturates”, that is when it attains numerically the boundary values (asymptotes) of 0 or 1. - means argument of the function is such large in magnitude that the value in output is numerically indistinguishable from 0 or 1
What could cause logistic function saturation? and what are the consequences?
Could be due to: large coefficient magnitude values, large observed xij magnitude values, or a combination.
If due to large x ij values software will throw a warning but inference on estimated coefficients may still be valid.
What is multinomial logistic regression
Multinomial logistic regression is used for multi-class classification.
Suppose that Y can take K attributes/classes/categories
The model is formulated using the softmax function.
How do the weights work in multinomial logistic regression?
You will have a set of biases w0k and weights wk for each class. Vector w for each k.
How are parameters optimized for multinomial regression?
Like for logistic regression, optimization and estimation of parameters can be implemented using gradient descent.
What issues can arise with multinomial regression?
In multinomial logistic regression complete separation and saturation can happen but separation can be more difficult to see. It’s very rare. Generally everything is intertwined and no two components are separated.
Error thrown for complete separation in R
Warning:
1. glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred.