advanced topics Flashcards

Question 1

Q

logistic regression

Answer

A

predicts the probability of y P(y) from our xs

P(yi) = 1/1+ exponential to the power of (β0 + β1 + xi)

Question 2

Q

what is probability

Answer

A

range from 0 to 1

Question 3

Q

binary outcomes

Answer

A

binary variables = type of categorical variable with only two levels

we code them 0 and 1 in terms of whether an event did or did not happen - this is NOT the same as dummy coding

Question 4

Q

what are odds

Answer

A

odds of an event occurring = the ratio of it occurring : it not occurring
odds can only ever be a positive value
odds = probability/(1-probability)

Question 5

Q

what are log odds

Answer

A

natural log of the odds - when plotted the log odds look linear and is a continuous DV

logodds = ln[P(y=1)/1-p(y=1)]

logodds above +4 and below -4 are considered 100% - since 0 is in the middle of these 0 = 50%

Question 6

Q

maximum likelihood estimation

Answer

A

MLE is used to estimate logistic regression models as MLE finds the logistic regression coefficients that maximise the likelihood of the observed data having occurred.
MLE minimises log-likelihood (indicates a better model)

Question 7

Q

evaluating logistic regression models

Answer

A

compare our model to a null model (with no predictors) and assess the improvement in fit

we compare our model to our baseline model using deviance
- deviance = -2 * loglikelihood (aka -2LL)
we calculate the difference in deviances between our model and the baseline and use p-values to assess significance

Question 8

Q

generalised linear model

Answer

A

in R this is the glm() function used to conduct logistic regression it uses the same format as lm() but with the addition of family = “ “ to determine what kind of regression we what/how the data will be distributed

Question 9

Q

binomial distribution

Answer

A

a discrete probability distribution

Question 10

Q

probability mass function

Answer

A

probability that a discrete random variable is exactly equal to some value

f(k,n,p) = Pr(x=k) = (n choose k) * p to power k * q to power n-k
where:
- k = number of successes
- n = number of trials
- p = probability of successes
- q = probability of failure (1-p)

Question 11

Q

interpreting glm() output

Answer

A

computation of residuals is different now we’re dealing with deviance (rather than variance) - a model with less residual deviance is better.
out β coefficients for the IV are the change in logodds of y for each unit increase of x

Question 12

Q

what is odds ratio

Answer

A

logodds don’t provide interpretable results therefore, the β coefficients from logodds are converted to odds ratios which are easier to interpret.
odds ratio is obtained by exponentiating the β coefficients

Question 13

Q

interpreting odds ratio

Answer

A

1 = no effect (50%)
<1 = negative effect - e.g. 0.8 = decrease in odds
>1 = positive effect - e.g. 1.2 = increase in odds

Question 14

Q

likelihood ratio test

Answer

A

method of logistic model comparison = tests if model line used is the best line to maximise likelihood
- alternative to z-test but can only be used for nested models (non-nested need AIC/BIC)

Question 15

Q

z-test

Answer

A

tests the statistical significance of predictors (can be prone to type 2 errors)

z = β / SE(β)

Question 16

Q

power analysis

Answer

A

power is the probability of CORRECTLY detecting an effect that exists - tells us what percentage of the time we would reject the null

power = 1 - β (NOT THE SAME β AS IN A REGRESSION)

power depends on:
- sample size
- effect size
- significance level

Question 17

Q

conventional value for power

Question 18

Q

power calculations in R

Answer

A

use the pwr package

examples:
t test
pwr.t.test( n = group size, d = effect size, sig.level = 0.05, power = 0.8, type = “two.sample”, alternative = “greater”)
- this is just an example so values may differ and not all of the above things may be included

correlation
pwr.r.test
- basically the same as above but d becomes r (corelation coefficient)

f-tests
pwr.f.test( u = k, v = (n-k-1), f2 = effect size, sig.level = 0.05, power = 0.8)
- again just an example so there will be actual number when i’ve just put general symbols

Question 19

Q

what is causality??

Answer

A

one event directly leads to another

this does not have to be a direct 1:1 relationship

Question 20

Q

conditions for causality

Answer

A

covariance = two variables change together
plausibility = does the relationship make sense
temporal precedence = if A causes B then A must always occur before B
no reasonable alternative other than A causes B

Question 21

Q

testing causality

Answer

A

identifying causal relationships is often possible through study design rather than statistical tests - it is harder to do this with observational studies but we can use:
- propensity score matching (simulated control group)
- instrumental variable analysis (simulates the effect of randomly assigning people to groups)
… to make causal claims from observational data

Question 22

Q

endogenity

Answer

A

a condition that effects our ability to make a causal claim.

theoretically = occurs when the marginal distribution of a predictor variable is not independent of the conditional distribution of the outcome variable given the predictor variable
practically = occurs when a predictor variable is correlated with the error term (causing bias in our β coefficients)

Question 23

Q

problems with endogenity

Answer

A

can’t easily tell if our variables are endogenous (both x and error are correlated)
even if you successfully identify endogeneity in your model you must determine why it is there to solve the problem

Question 24

Q

sources of endogeneity: simultaneity bias

Answer

A

causality goes both ways (x causes y, y causes x)
- solution = use statistical models developed specifically for this (e.g. two way least squares regression)

Question 25

Q

sources of endogeneity: omitted/confounding variables

Answer

A

when x is correlated with an omitted value (z), the variance explained by z falls on y and the residual error
- solution = ensure all potential confounds are measured and included in the model

Question 26

Q

sources of endogeneity: measurement error

Answer

A

instead of measuring x, you measure x* (x with error included)
- solution = careful planning and study design

Question 27

Q

interpolation

Answer

A

predicting a value from a model within the range of given data points
e.g. if your data spans 10 - 50 and using the data to predict someone with a value of 35

Question 28

Q

extrapolation

Answer

A

using a model to predict a value outside of the range of given data
e.g. if your data spans 10 - 50 and you use it to predict someone with a value of 60 or 5

need to take caution when using extrapolation since we don’ t have data points on both sides of our predicted values, we don’t know for sure it follows a linear pattern (as we would predict)

Question 29

Q

issues with missing data

Answer

A

loss of efficiency due to smaller n
bias (i.e. incorrect estimates)

Question 30

Q

types of missing data: MAR

Answer

A

missing at random
- related to the predictors but not the outcome
“when the probability of missing data on variable X is related to other variables in the model but not the value of X itself”

challenge = no way to confirm there is no relation between the predictors and the missing data

Question 31

Q

types of missing data: MCAR

Answer

A

missing completely at random
- genuinely random missingness, no relation between x/any other variable with the missingness of x
- effects all levels of our data equally/without bias

Question 32

Q

types of missing data: MNAR

Answer

A

missing not at random
“when the probability of missingness on x is related to the values of x itself”

challenge = no way to verify MNAR without knowledge of the missing values

Question 33

Q

methods of dealing with missing data: deletion methods

Answer

A

likewise deletion = delete everyone from the analysis with missing data
- NOT recommended - gives biased results

pairwise deletion = uses cases available for each analysis = different cases contribute to different correlation matrixes
- NOT recommended (but doesn’t reduce power as much as likewise)

Question 34

Q

methods of dealing with missing data: imputation methods

Answer

A

mean imputation = replace missing values with the mean of that variable
- NOT recommended - artificially reduces variability and biased (probs worst method)

regression imputation = replace missing values with their predicted values from regression model
- ‘normal’ vs scholastic (adds a residual term to overcome loss of variance)

multiple imputation (MI) = imputes missing data several times to create complete data sets (results are pooled to get parameter estimates and SEs)
- recommended if data is likely to be MAR

Question 35

Q

methods of dealing with missing data: maximum likelihood estimation (MLE)

Answer

A

estimation method = make use of all model information to arrive at the parameter estimate ‘as if’ the data was complete
- recommended if data likely to be MAR or MCAR

Question 36

Q

methods of dealing with missing data: methods for MNAR

Answer

A

selection models = combines model for predicting missingness which adjusts the parameter estimates for the analysis models of interest
- often gives worse results than MLE or MI

pattern mixture model = stratifies the sample according to different missing data patterns and estimates the substantive model in each subgroup
- provides strong, untestable assumptions
- good to include as part of sensitivity analysis but often between to use MLE or MI

Question 37

Q

exploratory analysis

Answer

A

used when we are interested in the relationship between variables but don’t have clear predictions about how they’re related/how to test them.
exploratory analysis can take many forms but share the common fact that the researcher doesn’t have specific predictions about the IV and DV

it is just done to learn about your data:
- focus on minimising prediction error
- data sets must be large enough to support training data
- estimate prediction error/assess model performance
- control bias-variance trade off

Question 38

Q

overfitting

Answer

A

= the tendency for statistical models to fit sample specific noise as if it were signal
since noise is random, fitting a model to noise makes it bad at predicting a new dataset

Question 39

Q

training data

Answer

A

= the data we ‘train’ our model with (the data used to fit the model line)

Question 40

Q

test data

Answer

A

= data we use to test how well our trained model can predict

Question 41

Q

p-hacking

Answer

A

= special (bad) case of overfitting that takes place prior to or in parallel with model estimation e.g. choosing which analysis to report (if data doesn’t fit, just remove it/ stargazing)

Question 42

Q

what is bias

Answer

A

the tendency for a model to consistently produce answers that are wrong in a particular direction

Question 43

Q

what is variance

Answer

A

the extent to which a model’s fitted parameters will tend to deviate from their central tendency across different datasets

Question 44

Q

bias-variance trade off

Answer

A

ideally we want low variance, low bias but that is rare in science so we make trade offs

low bias, high variance = flexible data analysis (almost any pattern can be detected which can be risky) = exploratory data analysis
high bias, low variance = strict adherence to a fixed set of procedures (limited range of patterns identified which is good) = confirmatory data analysis

Question 45

Q

cross-validation

Answer

A

cross validation - various techniques involved in testing and training a model on different samples of data

canonical cross validation = classical replication (where a model is trained on a dataset and tested on a completely different and independent dataset)

Question 46

Q

k-folding

Answer

A

used to test our model when it is not possible to collect new datasets - we recycle our original dataset.
k = number of folds (typical number is 10)

procedure:
- collect data e.g. for 100 participants
- use 90 people to train your data and then test the model on predicting the remaining 10 = one fold
- repeat this until everyone’s data has been used to both test and train models

Question 47

Q

confirmatory research

Answer

A

characterised by the fact that you specify prior to data collection the exact statistical analyses you intend to run and your expectations about the relationship between variables

Question 48

Q

mean squared error

Answer

A

used to assess model fit

MSE = (observed y value - estimated y value) squared to avoid negative numbers, added up for all observations and then multiplied by 1/n

the bigger the difference between the model estimate and the observed value, the higher MSE will be, indicating a worse model

HOWEVER MSE are heavily influenced by outliers in the data which sometimes leads researchers to choose other methods such as mean absolute error instead

Question 49

Q

poisson regression

Answer

A

only briefly touched on this
made specifically to deal with the problem of not allowing values that are below 0 and have a count tendancy