Week 1 Data screening and missing values Flashcards
What is the difference between MCAR, MAR and MNAR?
Missing completely at random – missing and not dependent on anything.
Missing at Random – Missing, has a rate generally due to circumstances (male vs. female, SES etc.
Missing not at random – There is an identified reason it is missing (embarrassment etc.)
Dealing with missing values: What are the pros and cons of row deletion?
Pro - Very simple
Con – May not be by random, and if its not it can introduce bias (end up with more males than females for example).
Dealing with missing values: What are the pros and cons of mean/median imputation?
Pro – simple
Cons – Artificially reduce variability
What is cohen’s d? and the effect sizes?
Cohen’s d is a measurement of the effect size that indicates the meaning of the relationship between variables. 0.20 = small, 0.50 = medium, 0.80 = large
Power (ability of a test to find an effect when it actually exists) can be affected by what 3 things?
Sample size
Effect size (cohen’s d)
Alpha (p value)
What is p value?
It is a measure of the probability that an observed difference could have occurred just by random chance. Alpha (p value) increases, power increases. For example, p < .05 has more power than p < .001 as there is more chance of detecting an effect or relationship.
How can you screen for out of range/miscoded data?
Frequency distribution table.
How do you identify a univariate outlier?
Calculate the z score. ‘Z’ scores have cut-off points that correspond to ‘p’ values and you can define someone as being a statistical outlier if their ‘z’ score is outside p < .001 (Z=+/-3.29) or p < .05 (Z=+/-1.96).
Decisions relating to univariate outliers depend on…?
- patterns of answers to other variables
- expectations that arise from your knowledge of the area (past research and theory).
- sample size.
- statistical technique you intend to use.
What is system missing data?
System missing data is where you find a blank, or perhaps a dot, in the cell where someone has not provided a response.
What is discrete missing data?
Discrete missing data is where you give SPSS a value for the system to help determine why it is missing.
What are the two main reasons that data can be missing?
Random reason – Random reasons are those reasons which are different for each respondent (accidentally missed a q, dropped out of study, ran out of time, didn’t know the answer).
Systematic reason - Systematic reasons occur when more than one person missed responding, but for the same reason. Those reasons effect some or all respondents systematically (unable to answer q, Q is inappropriate, problems with the Q, equipment failure).
What can we do once we understand the missing values?
Delete cases (listwise) Delete variable (with many MV’s) Replace them with something else.
What are some popular methods for replacing MV’s in MCAR?
Mean
Expectation imputation
Regression imputation
Full information maximum likelihood.
What does the univariate output tell us?
Gives a rough count of missing variables.