Preparing the Input Variables Flashcards
What does MCAR refer to?
missing completely at random
What is missingness?
the probability that a value is missing
What should you ask about missingness in your dataset
Is the missingness dependent on the data?
What are lurking inputs?
unobserved variables affecting the probability that a value is missing
What does it mean to select only cases that have no missing values?
Complete case analysis
Why is complete case analysis the default behavior in PROC LOGISTIC?
using complete case analysis biases your inferences the least
When is complete case analysis appropriate?
When you have a very small percent missing, and the values are missing completely at random
Why is complete case analysis not a good choice for predictive modeling?
small numbers of missing values can cause an enormous loss of data in high dimensions, the probability for missing values is high, and those values are unlikely to be missing by random chance, and scorability (the model won’t be able to score any new cases with missing data)
What is imputation?
The process of replacing missing values with reasonable substitutes
What type of imputation is best for a binary variable?
The Median
What is the hot decking method of imputation?
This method sorts all the cases in the sample by the values of several variables. A missing value is then imputed by taking the value from the case that is closest to it.
What purpose does a missing value indicator variable serve?
they may be used to detect the relationship between the missingness and the target.
Imputing a numeric using a median or mean value is generally effective when the missing values of a numeric input represent no more than _____% or all the input’s values.
50
Name three important goals when handling missing values for predictive modeling
retain all the original data for model development, score all new cases, capture the relationship of missingness with the target.
What quasi-complete separation?
Quasi-complete separation occurs when a level of the categorical input has a target event rate of either 0% or 100%