General Flashcards

1
Q

Multicollinearity and why is it a problem?

A

When predictors are correlated with each other.

Problem - the proportion of variance that one predictor explains is already explained by other predictors - so it confounds the estimates. without collinearity, the total explained variance of dependent variables should be equal to the sum of unique explained variances of separate predictors.

It has an Impact on estimation and explanation as well - increases st. errors and can even change the sign of coefficients. interpretation of coefficients and shared variance is harder.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why multiple testing is problematic

A

When we run multiple tests, it increases the risk of type I errors, thus the risk of finding a false positive just by chance. In general, there is a 5% chance that a test will show a significant results even if there is nothing going on, and if we run multiple tests, the probability of this kind of error increases. So we need to use corrections such as Bonferroni or Sidak to handle this problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Confidence Interval - Interpretation

A
  1. we are 95% confident that the true value lies between this interval
  2. if we were to draw random samples and computed the confidence intervals, 95% of those intervals would contain the true population value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Significance testing for coefficients

A

we need to be sure that the relationship between the variables is not only specific to one sample but is most likely generalizable. So significance testing is a whole process of assessing whether the observed data provides significant evidence to reject/not reject the null hypothesis. We use specific test statistics and p-values to check the effect.
t test - comparing means of continuous variables
F test - Comparing variances or testing model/group significance
Z test - Comparing means or proportions (large samples)
Chi square - for categorical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Statistical power - definition and 3 determinants

A

1-type II error - detecting a true effect when it exists.
depends on:
the sample size - as sample size increases, power increases
significance level - as alpha increases, power decreases
the effect size (e.g. Cohen’s d) - whether the effect is meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Standard Error

A

how much is the sample mean expected to deviate from the population mean.
SD/ფესვი n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Reproducibility Crisis

A

replication of the results is an crucial part of the science and reproducibility crisis refers to the problem in research field where some findings are hard to reproduce. This could be caused, .e.g. by multiple testing which increases the type I errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Missing Data - 4 steps

A

missing data - when you don’t have valid values for one or more variables
1. identifying the type of missing data - ingorable/nonignorable
2. identifying the extent of missing data
3. diagnose the randomness of missing data
4. select the imputation method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Randomness of Missing Data

A
  1. MCAR - completely at random
  2. MAR - at random - missing y values because of x
  3. MNAR - not at random - missing y values because of y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Levels of Missingness

A

item level
construct level
person level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MAR or MCAR?

A
  1. check the distribution
  2. t-test for missingness - MAR
  3. little MCAR’s test - H0: missing values are completely at random
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Imputation for MCAR

A

estimating the missing values based on other valid values: deletion, replacement by already known values (Cold or Hot deck), substituting by means or by prediction (regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Imptutation Methods and Extent of Missing data

A

under 10 - any of them
under 20 - MCAR - regression or hot deck. MAR - model based
over 20 - imputation is crucial.
regression for MCAR, model based for MAR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Imputation methods for MAR

A
  1. Maximum Likelihood
  2. Expected Maximisation
  3. Multiple Estimation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Outliers

A
  1. Error - due to inaccuracies in data collection - correct or delete
  2. Interesting
  3. Influencing - post-analysis stage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to detect outliers

A
  1. Univariate - boxplots, histograms, mad-median rule (>2.24), IQR - less masking
  2. Bivariate - Scatterplot
  3. Multivariate - Mahalanobis or cook’s distance