advanced topics Flashcards
logistic regression
predicts the probability of y P(y) from our xs
P(yi) = 1/1+ exponential to the power of (β0 + β1 + xi)
what is probability
range from 0 to 1
binary outcomes
binary variables = type of categorical variable with only two levels
we code them 0 and 1 in terms of whether an event did or did not happen - this is NOT the same as dummy coding
what are odds
odds of an event occurring = the ratio of it occurring : it not occurring
odds can only ever be a positive value
odds = probability/(1-probability)
what are log odds
natural log of the odds - when plotted the log odds look linear and is a continuous DV
logodds = ln[P(y=1)/1-p(y=1)]
logodds above +4 and below -4 are considered 100% - since 0 is in the middle of these 0 = 50%
maximum likelihood estimation
MLE is used to estimate logistic regression models as MLE finds the logistic regression coefficients that maximise the likelihood of the observed data having occurred.
MLE minimises log-likelihood (indicates a better model)
evaluating logistic regression models
compare our model to a null model (with no predictors) and assess the improvement in fit
we compare our model to our baseline model using deviance
- deviance = -2 * loglikelihood (aka -2LL)
we calculate the difference in deviances between our model and the baseline and use p-values to assess significance
generalised linear model
in R this is the glm() function used to conduct logistic regression it uses the same format as lm() but with the addition of family = “ “ to determine what kind of regression we what/how the data will be distributed
binomial distribution
a discrete probability distribution
probability mass function
probability that a discrete random variable is exactly equal to some value
f(k,n,p) = Pr(x=k) = (n choose k) * p to power k * q to power n-k
where:
- k = number of successes
- n = number of trials
- p = probability of successes
- q = probability of failure (1-p)
interpreting glm() output
computation of residuals is different now we’re dealing with deviance (rather than variance) - a model with less residual deviance is better.
out β coefficients for the IV are the change in logodds of y for each unit increase of x
what is odds ratio
logodds don’t provide interpretable results therefore, the β coefficients from logodds are converted to odds ratios which are easier to interpret.
odds ratio is obtained by exponentiating the β coefficients
interpreting odds ratio
1 = no effect (50%)
<1 = negative effect - e.g. 0.8 = decrease in odds
>1 = positive effect - e.g. 1.2 = increase in odds
likelihood ratio test
method of logistic model comparison = tests if model line used is the best line to maximise likelihood
- alternative to z-test but can only be used for nested models (non-nested need AIC/BIC)
z-test
tests the statistical significance of predictors (can be prone to type 2 errors)
z = β / SE(β)
power analysis
power is the probability of CORRECTLY detecting an effect that exists - tells us what percentage of the time we would reject the null
power = 1 - β (NOT THE SAME β AS IN A REGRESSION)
power depends on:
- sample size
- effect size
- significance level
conventional value for power
0.8
power calculations in R
use the pwr package
examples:
t test
pwr.t.test( n = group size, d = effect size, sig.level = 0.05, power = 0.8, type = “two.sample”, alternative = “greater”)
- this is just an example so values may differ and not all of the above things may be included
correlation
pwr.r.test
- basically the same as above but d becomes r (corelation coefficient)
f-tests
pwr.f.test( u = k, v = (n-k-1), f2 = effect size, sig.level = 0.05, power = 0.8)
- again just an example so there will be actual number when i’ve just put general symbols
what is causality??
one event directly leads to another
- this does not have to be a direct 1:1 relationship