Preparing the Input Variables Flashcards

1
Q

What does MCAR refer to?

A

missing completely at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is missingness?

A

the probability that a value is missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What should you ask about missingness in your dataset

A

Is the missingness dependent on the data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are lurking inputs?

A

unobserved variables affecting the probability that a value is missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean to select only cases that have no missing values?

A

Complete case analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is complete case analysis the default behavior in PROC LOGISTIC?

A

using complete case analysis biases your inferences the least

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is complete case analysis appropriate?

A

When you have a very small percent missing, and the values are missing completely at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is complete case analysis not a good choice for predictive modeling?

A

small numbers of missing values can cause an enormous loss of data in high dimensions, the probability for missing values is high, and those values are unlikely to be missing by random chance, and scorability (the model won’t be able to score any new cases with missing data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is imputation?

A

The process of replacing missing values with reasonable substitutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of imputation is best for a binary variable?

A

The Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the hot decking method of imputation?

A

This method sorts all the cases in the sample by the values of several variables. A missing value is then imputed by taking the value from the case that is closest to it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What purpose does a missing value indicator variable serve?

A

they may be used to detect the relationship between the missingness and the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Imputing a numeric using a median or mean value is generally effective when the missing values of a numeric input represent no more than _____% or all the input’s values.

A

50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name three important goals when handling missing values for predictive modeling

A

retain all the original data for model development, score all new cases, capture the relationship of missingness with the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What quasi-complete separation?

A

Quasi-complete separation occurs when a level of the categorical input has a target event rate of either 0% or 100%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does it mean to collapse categories using the thresholding method?

A

This method requires a minimum number of cases in a level in order to create a dummy code input for that level. Any class value that doesn’t meet the criteria is collapsed into a new ‘Other’ category

17
Q

What PROC might you run to help determine your threshold?

A

PROC FREQ

18
Q

Which of the following statements is true regarding Greenacre’s method for collapsing the levels of contingency tables?

 a. The method is appropriate for any categorical input.
 b. The method does not account for the sample size in each level.
 c. Levels with similar marginal response rates are merged.
 d. At each step, the levels that give the largest reduction in the chi-square statistic are merged.
A

C