logical regression Flashcards
what makes logical regression different to all the other types of regression: linear regression, multiple linear regression, non linear regression
For the others
They model ratio/scale data. DV must be ratio/scale
necessary because we use the sum of squared residual as a means to fit the model - using a parametric approach
logical regression
if DV has a limited range - e.g., either 1 or 0, or between 0 and 100
- could be like marks in a test
- accuracy scores
- etc
what is the typical linear regression equation
here we assume a linear relationship between the iV and DV
why can we not use linear regression if our IV is limitted in range e.g., pass (1) or fail (0)
because certain values will indicate 0 (e.g., 40%) and other values might indicate 1 (e.g., 90%)
Problem: values below 40, model will predict values lower than 1, and value above 90 predicts values higher than 1. Also, anything in between will equate to something between 0 and 1.
this doesn’t make sense
- cant have values between 0 and 1 - want to predict ONLY 0 and ONLY 1
- and cant have values exceeding 1 / less than 0 0 but the regression equation predicts values outside the range of 0 -to-1
serious problem. This is because it creates really large residuals. this will distort or bias our regression fit.
Residuals will violate the assumptions of homoscedasticity because the data range is limited to 0 and 1.
what is logistic regression
a special case of non linear regression.
why is logistic regression a special case of non linear regression
because it deals with this limitation in range
different types of logistic regression
Logistic regression
If you have a limited range in DV e.g., proportion of correct answers on a test. this gives continuous prediction
Binary logistic regression
type of logistic regression where the DV is the binary e.g., 0 or 1. this just ensures we get a binary outcome of either 0 or 1.
both cases deal with this limitation in range of the DV
if i asked n to a 7 point likert scale and then average the scores would I use logical regression because technically there is a limited range of answers
No, because while the scale is limited in range you are analysing the average, which, according to central limit theorem is normally distributed.
whats the big problem of using linear regression with data limitted in range?
the linear equation will fit the 0/1 values at certain points but everywhere else the residual is large! big problem is that it will predict values larger than 1 and smaller than 0.
we have a real problem with the residuals. and whenever we fit linear regression models. the residuals are what we use to do the fitting
will bias any result we get - will be a problem
cant we just fit a non-linear curve to the binary/limited range DV
nicely levels off at 0 and at 1
Let’s say we invent and fit a logistic curve to the binary data - it seems to do quite ok. Can we be satisfied with this
no, while it fits ok we want to find the best fitting logistic curve. that’s what logistic regression does.
the best fitting curve that has an S shape
What is the equation for the non-linear curve we fit in logistic regression
what is e
its a constant called Eulers number
what is the OLS regression equation
what is the rationale for using the logistic regression equaiton
- deals with the limitation of range - e.g., 0 to 100
- functional form is very flexible - fits a wide range of data
- there are analytical solutions for it - looking up eulers number. to the power of X
- easier to compute than non linear regression problems
just link in linear regression the form of the equation we are fitting is _____?
fixed
thus when fitting the model we are just finding the best fitting numerical values for r the coefficients in the equation (c and b)
what is logistic regression doing?
modelling/predicting data between 0 and 1
mathematically, is a prediction and what do we call it in statistics?
mathematically a prediction is the probability that a case has a value of 0 or 1
how do we get the probability/ prediction
what can we use the probability (prediction) to compute?
the odds
the equation: the probability of an event happening divided by the probability of it not
essentially the odds are Euler’s number raised to the power of our best-fitting coefficients. so if you know the logistic regression equation you can compute probability directly but can also compute the odds
what do we use to measure effect size in logistic regression
the odds ratio
how do we compute the log odds
the natural logarithm of the odds
basically taking the inverse of raising something to the power
what kind of relationship does the logistic regression have with the log odds
any logistic regression is linear with respect to the log odds (just like with OLS regression)
so by taking the natural logarithm of the odds you are creating a new unit (or DV if you will) that is now linear in terms of the independent variable X
so the log odds vary from negative infinity to infinity as the log odds move from 0 to 1
what is another word for log odds
logit
logit regression equation
Logit regression equation (c + bX)
- So result of this Is your logit
- Or logistic probability unit
here the logit to have a yes vs no answer is - 2113.056
so we have the normal regression equation with the constant and coefficient. to get the logit we just pluck X (-6 in this particular case) in
when would you use Euler
if you wanted to compute the odds or probability for a data point that is not in your dataset.
how do you go from the logit to the odds
exp then in parenthesis whatever you have is just a different notation for eulers number raised to the power. this is how you would enter it (universal - this is how its done in R, SPSS, MATLAB).
describe the relationship between the logit and odds
E to the power of (logit), and taking the natural logarithm are the inverse operations of one another
So you take the natural log of the odds to get the logit (orange arrow). To go from the logit to the odds you raise e to the power of the logit (green arrow)
To get the odds you literally just type:
“exp(logit value, -213.056 here)”
what would 2.9577E-93 be
just means it’s a really small number. If it’s a negative sign after the E it just means you shift the decimal place this many places (93 here) to the left.
if it was E+93, then you shift the decimal place 93 places to the right
how do you go from the odds to the probability/prediction ?
odds/(1+odds)
Why has Lore added the column on the end “rounded p”
here has the logistic regression done a good job matching the outcome?
what relationship is there between the measure and the logit
Because the number is soooo small (look at the e- on the end). And remember we are doing logistic regression - so outcome has to be either 0 or 1
in this example, you can see the prediction/probability matches the outcome very well (yes/no column). so in this case the logistic regression equation does a really good job.
if you plotted the relationship between the measure and the logit the relationship would be perfectly linear
what is the relationship between te logit and measure?
how does this compare to the relationship between the measure and data itself (the yes/no responses).
linear
this relationship has a limitation in range and has an S shape curve that is the best fit
what the hell is a case vs not a case
is it always right?
when the prediction of the logistic regression is a “yes” or 1 its a case, and “no or 0 is not a case.
sometimes your model might predict something to be a case when it’s actually not. this is where the residuals come in (this is what we’re trying to minimise when we fit the regression equation)
(we try to minimise the sum of the squared residuals)
(residuals of the logit)
how do we turn the probability spat out by the logistic regression into binary outcome
you have a cut-off, typically .5
what do the odds range between
what range for the probability
whats the relatinship between the two
0 and positive infinity
probability can only range between 0 and 1
they are related such that when you have an increase in the probability you have an increase in the odds
Which do we report: the odds or the probabilty?
up to you!
how do you write up whether something was a case or not?
how can we find the point of 50/50 split in the dataset
take the negative of the constant and divide it by the coefficient
Tells you where the datapoint in your dataset is exactly .5
so this data point might not exist in your dataset but if you wanted to know which level of the IV would mark the 50/50 split in prediction
what is a classification table
just put in how many of the stuff was correctly labelled as either a fail or pass.