General Flashcards
Multicollinearity and why is it a problem?
When predictors are correlated with each other.
Problem - the proportion of variance that one predictor explains is already explained by other predictors - so it confounds the estimates. without collinearity, the total explained variance of dependent variables should be equal to the sum of unique explained variances of separate predictors.
It has an Impact on estimation and explanation as well - increases st. errors and can even change the sign of coefficients. interpretation of coefficients and shared variance is harder.
Why multiple testing is problematic
When we run multiple tests, it increases the risk of type I errors, thus the risk of finding a false positive just by chance. In general, there is a 5% chance that a test will show a significant results even if there is nothing going on, and if we run multiple tests, the probability of this kind of error increases. So we need to use corrections such as Bonferroni or Sidak to handle this problem.
Confidence Interval - Interpretation
- we are 95% confident that the true value lies between this interval
- if we were to draw random samples and computed the confidence intervals, 95% of those intervals would contain the true population value
Significance testing for coefficients
we need to be sure that the relationship between the variables is not only specific to one sample but is most likely generalizable. So significance testing is a whole process of assessing whether the observed data provides significant evidence to reject/not reject the null hypothesis. We use specific test statistics and p-values to check the effect.
t test - comparing means of continuous variables
F test - Comparing variances or testing model/group significance
Z test - Comparing means or proportions (large samples)
Chi square - for categorical data
Statistical power - definition and 3 determinants
1-type II error - detecting a true effect when it exists.
depends on:
the sample size - as sample size increases, power increases
significance level - as alpha increases, power decreases
the effect size (e.g. Cohen’s d) - whether the effect is meaningful
Standard Error
how much is the sample mean expected to deviate from the population mean.
SD/ფესვი n
Reproducibility Crisis
replication of the results is an crucial part of the science and reproducibility crisis refers to the problem in research field where some findings are hard to reproduce. This could be caused, .e.g. by multiple testing which increases the type I errors.
Missing Data - 4 steps
missing data - when you don’t have valid values for one or more variables
1. identifying the type of missing data - ingorable/nonignorable
2. identifying the extent of missing data
3. diagnose the randomness of missing data
4. select the imputation method
Randomness of Missing Data
- MCAR - completely at random
- MAR - at random - missing y values because of x
- MNAR - not at random - missing y values because of y
Levels of Missingness
item level
construct level
person level
MAR or MCAR?
- check the distribution
- t-test for missingness - MAR
- little MCAR’s test - H0: missing values are completely at random
Imputation for MCAR
estimating the missing values based on other valid values: deletion, replacement by already known values (Cold or Hot deck), substituting by means or by prediction (regression)
Imptutation Methods and Extent of Missing data
under 10 - any of them
under 20 - MCAR - regression or hot deck. MAR - model based
over 20 - imputation is crucial.
regression for MCAR, model based for MAR
Imputation methods for MAR
- Maximum Likelihood
- Expected Maximisation
- Multiple Estimation
Types of Outliers
- Error - due to inaccuracies in data collection - correct or delete
- Interesting
- Influencing - post-analysis stage