Biostats test 3 Flashcards
null hypothesis for chi squared test
H0: no association, variables are independent
How does H0 translate to cell frequencies?
Cell counts are proportional to the marginal (row and column) totals.
Formula for expected frequencies
In formula: E = (row total x column total) / n
What does chi quared test measure
If the differences between observed and expected frequencies are large enough to reject H0
DF for a 2x2 table
1
Formula for DF of a cross-table
df = (rows – 1) x (columns – 1)
What do we use to test effect size of chi squared test
Phi for 2x2, Cramer’s V for cross-table. Only do this for if test is significant, then will inform about the strength of the association.
Chi squared goodness of fit test
Determines whether the distribution of observed frequency counts differs from some other distribution
Odds formula
prob event/prob non-event
Risk formula
risk for specific/total
Sensitivity
the proportion of positives that are correctly identified as such (so sick people being diagnosed as having the condition)
Specificity
the proportion of negatives correctly identified as such (healthy people being diagnosed as not having the condition)
Prevalence
The number of cases of a disease, number of infected people, or number of people with some other attribute present during a particular interval of time
Sensitivity formula
TP / (TP + FN)
Specificity formula
TN / (FP + TN)
PPV definition
The likelihood that a person who has a positive test result does have the disease, condition, biomarker, or mutation (change) in the gene being tested
PPV formula
PPV = TP / (TP + FP)
NPV formula
NPV = TN / (FN + TN)
Bayes
demonstrated how prior probabilities may affect estimated probabilities for events.
Cohen’s kappa
measures inter/intrarater reliability
Cohen’s kappa formula
2(ad – bc) / (r1 x c2 + r2 x c1)
What is regression analysis used for
Predict values of an outcome variable on the basis of other variables. Aim is to build a model that describes variability in a dependent variable (Y) as a function of one or more independent (X) variables: Yi = f(X1i, X2i, …). Causality may very well play a role.
What does pearson’s R squared represent in OLS output
the proportion of total variation which is being explained: ssmodel/sstotal
What does an outlier in the Y space do to the correlation coefficient
Pulls the correlation towards it, boosting it
What does an outlier in the X space do to the correlation coefficient
Pulls the correlation towards it, lowering it
Simpson’s paradox
a phenomenon in statistics where a trend that appears in several groups of data reverses when the data is combined. In other words, a relationship between two variables that holds within individual groups can disappear or even reverse when the data from those groups is pooled together.
When is Spearman’s rank used
variables are at ordinal data level, the relationship is monotonic, but non-linear or
outliers might affect Pearson’s r too much
What are model coefficients in a regression equation
the terms in your function that optimally relate predicted values of the dependent variables to observed values, often denoted as b0, b1, b2
What you should look for in your data before you start a regression analysis
Correct mistakes, check for outliers, stratification, non-linearities, … for all possible predictors!
What is each factor in this equation: Yi = b0 + b1X1i + ei
b0 is called the intercept (it is a constant). b1 is the regression coefficient for X1 = (estimated) slope for best fitting line in scatter plot of X versus Y. ei is a prediction error, a residual –it is the difference between a predicted and an observed Y value (for a given X value). Yi = predicted value of Y on basis of model + prediction error.
What is H0 for OLS regression
H0: no relation between X1 and Y. b1 = 0, no effect of changes of X1 on Y. r2 = 0, no variance explained by the model
Aim in OLS for finding values
want to find values for b0 and b1 that optimally relate outcomes of Y to values of X1
Predicting Y without any information about X
Mean of Y would be best guess because unbaised. This total prediction error = total sum of squares. Without predictors, SS total = SS error (sum of squares of residuals). All the variability in Y would be unexplained error.
Predicting Y when you do have info about predictors
Adding predictors to your model, you hope to make better predictions.
By adding predictors, you hope that the ratio of SS total and SS error (sum of squares of residuals) improves, eg that a larger percentage of the variance in Y can be explained.
Better model, < SS error.
‘best fitting’: Least squares criterion (OLS regression)
The ‘best fitting model’ is the one for which SS error reaches a minimum.
Yi
predicted value on basis of model + prediction error
SStotal
Sum of squared difference scores between Y values and mean of Y (the total variability in Y around its mean) ; SSmodel + SSerror. But in the case when nothing is known about the predictors (so-called null model), SStotal = SSerror
SSerror
sum of the squared differences between the actual observed values of the dependent variable (Y) and the values predicted by the model. (Yi - Yhat)^2
SSmodel
Sum of squared differences between the mean of Y and the predicted value of Y. The amount of variability in Y that is explained by the model. (Y hat - Y bar)^2. Variance explained.
Model fit is proportion of
total variability in Y (SStotal) accounted for by sum of squares model prediction (SSmodel). so SSmodel/SStotal
A Standardized (Beta) regression coefficient
indicates how many standard deviations the dependent variable changes with a standard deviation change of the predictor
How can beta values be interpreted
in terms of the importance the predicators have in the predictive power of the model
Larger |value|, more influence
Looking at Beta values is useful with multiple regression models –different coefficients may all reach statistical significance, but their impact may differ
indicator coding
creating dummy variables with values 0 or 1
dummy variables
Dummy variables have values 0 or 1, where 0 means: ‘does not have the property’ and 1 means ‘has the property’. To code k categories, you need k - 1 dummies.
reference category
an original level value that does not have its own dummy variable is the reference category
synergy
if an interaction term is significant and its b is positive, the predictors have synergy (they strengthen each other)
Multiple predictors
With two predictor X1 and X2, b1 indicates how Y will change with a unit change of X1, while holding X2 at a constant value
b2 indicates have much Y changes with a unit change of X2, holding X1 constant
What does multiple regression allow us to estimate
the unique contribution of a predictor Xk to the outcome, given the other X variable(s) in the model. Please note that multiple regression therefore provides a way of adjusting / accounting for potentially confounding variables by including these in the model
B value for a dummy
the estimate of the mean difference on the DV between the dummy level and the ‘uncoded’ reference. For example, blow, tells you how much the mean outcome for “low” differs from the mean outcome for “standard.” If the coefficient is significant, id differs significantly.
Modelling interaction between x1 and x2
create a new variable X1ByX2 which is simply the product of the two original ones
synergy
if an interaction term is significant and its b is positive, the predictors have synergy (they strengthen each other)
anti-synerg
If an interaction term is significant and its b is negative, anti-synergy (they weaken each other)
Adjusted R2
R2 may become spuriously high if your model is ‘overspecified’ (ie., if it has too many predictors relative to the number of cases). Adjusted R2 attempts to compensate for the spurious increase in predictive power. So more conservative, protects against type I error
Standard error of estimate in model summary
can be interpreted as the average magnitude (in original units of measurement) of predication error
steps in evaluation of MLR output
- Check significance of F-test: if p < α reject H0
- Check size of R2 : if large enough (whatever that means, context matters) → relevant model
- Check significance of t-tests for coefficients (for each, if p < α reject H0 )
- Check sign and size of unstandardized (b) coefficients for substantive interpretation (‘how much does Y change with unit chance in X’)
- Check absolute value of standardized (beta) coefficients for relative importance
What does the F-value of multiple regression output tell us
If the F-test is significant (if 𝑝 < 0.05), it suggests that the model provides a better fit to the data than a model with no predictors.
Sequential analysis
test whether adding predictors leads to a significant improvement of our model.
We want to know whether the R2 change (model 2 versus model 1) is significantly different from zero.
Use the F-change test to answer the question whether the more elaborate model 2 is better than model 1
Homoscedastic model
magnitudes of the error terms do not depend on the x-value. the spread of errors remains constant with the values of the predictor
Why is homoscedasticity good
If the assumption is not met, the fit of the model (multiple r2) may be overestimated
Actually, you cannot really speak of the fit of the model, because the fit changes with values of the predictor
heteroscedasticity
Heteroscedasticity occurs when the residuals have non-constant variance across levels of the independent variable. the spread of errors changes with the values of the predictor.
independence of residuals assumption
the error for yi+1 should not depend on the error for yi
consequences of multicolinearity
At best: the model contains redundant elements.
→ The model is more complex than needed
Worse: coefficient values may change erratically in response to small changes in the model or the data, and / or the ordering of your model building process (case 4, previous slide)
→ The model is unstable
→ Model is hard to interpret / unreliable /invalid
multicolinearity indicators
- Significant F-test, but insignificant coefficients for specific IV(‘s)? Suspect, but it could be that IV truly does not relate to DV
- Large changes in coefficient values when a predictor variable is added or deleted? ’case 4’. Really suspect!!
- IV is significant as single predictor, but insignificant in multiple regression model? Smoking gun!