Logistic Regression (6) Flashcards
Categorical Independent Variables
• what happens when one of the independent variables of a regression analysis is
categorical?
• eg, you’re working on a land use project evaluating various soil properties, one of
which is “parent rock type”
• or, a regression model for estimating house value that includes whether the house
has an attached garage, a detached garage, or no garage?
• in these cases we introduce a set of dummy variables
• when needed, there will be k – 1 dummy variables, where k is the total number of
categories available
• always remove 1 category – otherwise the dataset will have perfect
multicollinearity and therefore violate the assumptions of the regression analysis
• dummy variables are always binary, 0 or 1, and represent presence or absence of
that phenomenon – if observation 1 is present in category A, then it cannot occur in
any other category
• interpretation of the model follows the same logic as interpreting a simple
regression model:
H(price)=131,034+ 38,408(1) + 23,153(0)
• a house with no garage would be $131,034
• a house with an attached garage would be $38,408 more than one without a
garage, or $169,442
• a house with a detached garage would be $23,153 more than one without a garage,
or $154,187
Categorical dependent variable
• what happens when the dependent variable of a regression analysis is categorical?
• eg, you’re working on a species distribution project and want to design a predictive
model of whether a species is present or not based on a suite of environmental
conditions
• or, a regression model to determine whether commuters will take the train based
on how long their travel time by car is
Logistic curves are better at integreating categorical dependent vairbales
logisitic regression model
.linear model does not work for categorical dependents
.• this requires a new form of regression analysis known as logistic, or logit, regression
• logistic regression does 4 things:
• take the probability of an event/phenomenon occurring for different values of each
independent variable
• take the ratio of those probabilities, which is a continuous, non-negative value
• take the logarithm of that ratio, known as the logit, which is then fitted to the
independent variable(s) using linear regression
• the predicted logit is then converted back into predicted probabilities using an
exponential function
• the result is an estimate of the probability of the dependent variable occurring (ie, the
likelihood of the presence of some phenomenon)
there is no exact equivalent for r
2
in logistic regression – instead there are several
typical values that software will provide for you
.the Cox & Snell r
2
is interpreted similarly to linear regression r
2
, but it’s maximum value
is 0.75 and can change to < 0.5 as n changes
• Nagelkerke r
2
is an attempt to correct the Cox & Snell r
2 so that it has a maximum value
of 1, like the ordinary r
2
– although this is a little more interpretable it is still dependent
on n and may give ambiguous results
• a third option – not given by SPSS – is known as the likelihood ratio as is most analogous
to the ordinary r
2
, representing the difference between the predicted and observed
deviance
• these are all known as pseudo-r
2s
• what if the dependent variable is categorical but not binary? – eg, predict the dominant vegetation type (grassland, tundra, deciduous forest, etc.) based on a number of independent climate variables
• this is known as multinomial logistic regression
• in this approach, you need to define a reference category, and the regression model
will provide you with probability of occurrence vs that reference category
• eg, a unit increase in temperature is associated with an increase in grassland vs
tundra by 0.07
• multinomial logistic regression is based on maximum likelihood estimations, which
require large datasets
• what if the dependent variable is composed of more than 2 ordered categories? – eg,
predict the level of education attained by a person based on a number of socioeconomic
indicators
this is known as ordinal logistic regression
• in this method, each ranked category has a certain probability of occurrence, the
space between each category is equal – the independent variables are constantly
added together until you can cross the gap between individual categories
• like the other forms of logistic regression, this is based in maximum likelihood
estimations which require large sizes
• the coefficients of simple and multiple regression models are estimated using
“ordinary least squares”; in logistic regression, this process is known as
“iteratively reweighted least squares”