advanced topics Flashcards
logistic regression
predicts the probability of y P(y) from our xs
P(yi) = 1/1+ exponential to the power of (β0 + β1 + xi)
what is probability
range from 0 to 1
binary outcomes
binary variables = type of categorical variable with only two levels
we code them 0 and 1 in terms of whether an event did or did not happen - this is NOT the same as dummy coding
what are odds
odds of an event occurring = the ratio of it occurring : it not occurring
odds can only ever be a positive value
odds = probability/(1-probability)
what are log odds
natural log of the odds - when plotted the log odds look linear and is a continuous DV
logodds = ln[P(y=1)/1-p(y=1)]
logodds above +4 and below -4 are considered 100% - since 0 is in the middle of these 0 = 50%
maximum likelihood estimation
MLE is used to estimate logistic regression models as MLE finds the logistic regression coefficients that maximise the likelihood of the observed data having occurred.
MLE minimises log-likelihood (indicates a better model)
evaluating logistic regression models
compare our model to a null model (with no predictors) and assess the improvement in fit
we compare our model to our baseline model using deviance
- deviance = -2 * loglikelihood (aka -2LL)
we calculate the difference in deviances between our model and the baseline and use p-values to assess significance
generalised linear model
in R this is the glm() function used to conduct logistic regression it uses the same format as lm() but with the addition of family = “ “ to determine what kind of regression we what/how the data will be distributed
binomial distribution
a discrete probability distribution
probability mass function
probability that a discrete random variable is exactly equal to some value
f(k,n,p) = Pr(x=k) = (n choose k) * p to power k * q to power n-k
where:
- k = number of successes
- n = number of trials
- p = probability of successes
- q = probability of failure (1-p)
interpreting glm() output
computation of residuals is different now we’re dealing with deviance (rather than variance) - a model with less residual deviance is better.
out β coefficients for the IV are the change in logodds of y for each unit increase of x
what is odds ratio
logodds don’t provide interpretable results therefore, the β coefficients from logodds are converted to odds ratios which are easier to interpret.
odds ratio is obtained by exponentiating the β coefficients
interpreting odds ratio
1 = no effect (50%)
<1 = negative effect - e.g. 0.8 = decrease in odds
>1 = positive effect - e.g. 1.2 = increase in odds
likelihood ratio test
method of logistic model comparison = tests if model line used is the best line to maximise likelihood
- alternative to z-test but can only be used for nested models (non-nested need AIC/BIC)
z-test
tests the statistical significance of predictors (can be prone to type 2 errors)
z = β / SE(β)
power analysis
power is the probability of CORRECTLY detecting an effect that exists - tells us what percentage of the time we would reject the null
power = 1 - β (NOT THE SAME β AS IN A REGRESSION)
power depends on:
- sample size
- effect size
- significance level
conventional value for power
0.8
power calculations in R
use the pwr package
examples:
t test
pwr.t.test( n = group size, d = effect size, sig.level = 0.05, power = 0.8, type = “two.sample”, alternative = “greater”)
- this is just an example so values may differ and not all of the above things may be included
correlation
pwr.r.test
- basically the same as above but d becomes r (corelation coefficient)
f-tests
pwr.f.test( u = k, v = (n-k-1), f2 = effect size, sig.level = 0.05, power = 0.8)
- again just an example so there will be actual number when i’ve just put general symbols
what is causality??
one event directly leads to another
- this does not have to be a direct 1:1 relationship
conditions for causality
- covariance = two variables change together
- plausibility = does the relationship make sense
- temporal precedence = if A causes B then A must always occur before B
- no reasonable alternative other than A causes B
testing causality
identifying causal relationships is often possible through study design rather than statistical tests - it is harder to do this with observational studies but we can use:
- propensity score matching (simulated control group)
- instrumental variable analysis (simulates the effect of randomly assigning people to groups)
… to make causal claims from observational data
endogenity
a condition that effects our ability to make a causal claim.
- theoretically = occurs when the marginal distribution of a predictor variable is not independent of the conditional distribution of the outcome variable given the predictor variable
- practically = occurs when a predictor variable is correlated with the error term (causing bias in our β coefficients)
problems with endogenity
- can’t easily tell if our variables are endogenous (both x and error are correlated)
- even if you successfully identify endogeneity in your model you must determine why it is there to solve the problem
sources of endogeneity: simultaneity bias
causality goes both ways (x causes y, y causes x)
- solution = use statistical models developed specifically for this (e.g. two way least squares regression)
sources of endogeneity: omitted/confounding variables
when x is correlated with an omitted value (z), the variance explained by z falls on y and the residual error
- solution = ensure all potential confounds are measured and included in the model
sources of endogeneity: measurement error
instead of measuring x, you measure x* (x with error included)
- solution = careful planning and study design
interpolation
predicting a value from a model within the range of given data points
e.g. if your data spans 10 - 50 and using the data to predict someone with a value of 35
extrapolation
using a model to predict a value outside of the range of given data
e.g. if your data spans 10 - 50 and you use it to predict someone with a value of 60 or 5
- need to take caution when using extrapolation since we don’ t have data points on both sides of our predicted values, we don’t know for sure it follows a linear pattern (as we would predict)
issues with missing data
- loss of efficiency due to smaller n
- bias (i.e. incorrect estimates)
types of missing data: MAR
missing at random
- related to the predictors but not the outcome
“when the probability of missing data on variable X is related to other variables in the model but not the value of X itself”
challenge = no way to confirm there is no relation between the predictors and the missing data
types of missing data: MCAR
missing completely at random
- genuinely random missingness, no relation between x/any other variable with the missingness of x
- effects all levels of our data equally/without bias
types of missing data: MNAR
missing not at random
“when the probability of missingness on x is related to the values of x itself”
challenge = no way to verify MNAR without knowledge of the missing values
methods of dealing with missing data: deletion methods
likewise deletion = delete everyone from the analysis with missing data
- NOT recommended - gives biased results
pairwise deletion = uses cases available for each analysis = different cases contribute to different correlation matrixes
- NOT recommended (but doesn’t reduce power as much as likewise)
methods of dealing with missing data: imputation methods
mean imputation = replace missing values with the mean of that variable
- NOT recommended - artificially reduces variability and biased (probs worst method)
regression imputation = replace missing values with their predicted values from regression model
- ‘normal’ vs scholastic (adds a residual term to overcome loss of variance)
multiple imputation (MI) = imputes missing data several times to create complete data sets (results are pooled to get parameter estimates and SEs)
- recommended if data is likely to be MAR
methods of dealing with missing data: maximum likelihood estimation (MLE)
estimation method = make use of all model information to arrive at the parameter estimate ‘as if’ the data was complete
- recommended if data likely to be MAR or MCAR
methods of dealing with missing data: methods for MNAR
selection models = combines model for predicting missingness which adjusts the parameter estimates for the analysis models of interest
- often gives worse results than MLE or MI
pattern mixture model = stratifies the sample according to different missing data patterns and estimates the substantive model in each subgroup
- provides strong, untestable assumptions
- good to include as part of sensitivity analysis but often between to use MLE or MI
exploratory analysis
used when we are interested in the relationship between variables but don’t have clear predictions about how they’re related/how to test them.
exploratory analysis can take many forms but share the common fact that the researcher doesn’t have specific predictions about the IV and DV
it is just done to learn about your data:
- focus on minimising prediction error
- data sets must be large enough to support training data
- estimate prediction error/assess model performance
- control bias-variance trade off
overfitting
= the tendency for statistical models to fit sample specific noise as if it were signal
since noise is random, fitting a model to noise makes it bad at predicting a new dataset
training data
= the data we ‘train’ our model with (the data used to fit the model line)
test data
= data we use to test how well our trained model can predict
p-hacking
= special (bad) case of overfitting that takes place prior to or in parallel with model estimation e.g. choosing which analysis to report (if data doesn’t fit, just remove it/ stargazing)
what is bias
the tendency for a model to consistently produce answers that are wrong in a particular direction
what is variance
the extent to which a model’s fitted parameters will tend to deviate from their central tendency across different datasets
bias-variance trade off
ideally we want low variance, low bias but that is rare in science so we make trade offs
- low bias, high variance = flexible data analysis (almost any pattern can be detected which can be risky) = exploratory data analysis
- high bias, low variance = strict adherence to a fixed set of procedures (limited range of patterns identified which is good) = confirmatory data analysis
cross-validation
cross validation - various techniques involved in testing and training a model on different samples of data
canonical cross validation = classical replication (where a model is trained on a dataset and tested on a completely different and independent dataset)
k-folding
used to test our model when it is not possible to collect new datasets - we recycle our original dataset.
k = number of folds (typical number is 10)
procedure:
- collect data e.g. for 100 participants
- use 90 people to train your data and then test the model on predicting the remaining 10 = one fold
- repeat this until everyone’s data has been used to both test and train models
confirmatory research
characterised by the fact that you specify prior to data collection the exact statistical analyses you intend to run and your expectations about the relationship between variables
mean squared error
used to assess model fit
MSE = (observed y value - estimated y value) squared to avoid negative numbers, added up for all observations and then multiplied by 1/n
the bigger the difference between the model estimate and the observed value, the higher MSE will be, indicating a worse model
HOWEVER MSE are heavily influenced by outliers in the data which sometimes leads researchers to choose other methods such as mean absolute error instead
poisson regression
only briefly touched on this
made specifically to deal with the problem of not allowing values that are below 0 and have a count tendancy