Understanding Predictive Modeling Flashcards

1
Q

What is oversampling?

A

disproportionately over-represents the event cases (for example, an equal number of events and non-events)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a use case for oversampling?

A

This is typically done when the original data set is very large and the ratio of events to non-events is very small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name four common data-related challenges

A

observational data
mixed measurement scales
high dimensionality
rare target events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the curse of dimensionality?

A

High dimensionality limits your ability to explore and model the relationships among the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are two issues associated with the curse of dimensionality?

A

The number of variables may affect computational performance more than the number of cases

Including more variables makes the values more spread out, making it difficult to identify relationships in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a biased sample?

A

a sample that is produced by oversampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is meant by observational data?

A

Data gathered for some purpose other than data analysis, i.e. operational data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some common problems found in operational data?

A

errors
missing data
redundant variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name three analytical challenges:

A

nonlinearities
interactions
model selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the process of choosing the model with the highest predictive value accuracy?

A

Model selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it mean to overfit the data?

A

Using an overly complex model that is too sensitive to peculiarities in the sample data set that cannot generalize well to new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean to underfit the data?

A

Underfitting the data occurs when the model is too simple and systematically misses the true features in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When is joint sampling appropriate?

A

When the target event is not rare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is joint sampling?

A

Joint sampling selects a representative sample of the data by randomly selecting input-target pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which sampling method would you choose when your event is not rare?

A

Joint sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which sampling method would you choose when your event is rare?

A

Separate sampling (which is used to oversample the data)

17
Q

What sampling method creates a target-based sample is created by drawing samples separately based on the target outcome – that is, whether it is a non-event or an event?

A

Separate sampling

18
Q

What is the optimism principle?

A

The optimism principle states that when you assess the accuracy of a predictive model on the same data that was used to fit the model, you tend to get better assessment statistics than when you assess the model on other data.

19
Q

What does a large differences between the performance on the training and test data sets usually indicate?

A

overfitting / optimism bias

20
Q

What should you be concerned with as underlying model becomes more flexible and the data less plentiful?

A

overfitting

21
Q

Name two activities that increase the risk of overfitting?

A

variable selection methods and supervised input preparation (such as collapsing levels of nominal variables based on associations with the target)

22
Q

What is optimism bias?

A

Overfitting the data causing large differences between the performance on the training and test data sets

23
Q

What is assessing the performance of a model on new data that was not used to fit the original model referred to as?

A

honest assessment

24
Q

What are some methods of honest assessment?

A

Splitting the data, k-fold validation, and bootstrapping

25
Q

What is the simplest way to do an honest assessment of how well your model generalizes to new data?

A

Split your data into two data sets: a training data set and a validation data set

26
Q

How does splitting the data allow you to avoid optimism bias?

A

Because you are assessing your model on different cases than you used to fit the model, you avoid the optimism bias and get a valid assessment measurement.

27
Q

Which dataset is used to assess and compare models?

A

the test dataset

28
Q

Which dataset is used to fit the model?

A

the training dataset

29
Q

Which dataset is used for comparing, selecting, and tuning models?

A

the validation dataset

30
Q

Which dataset is used for the final assessment of a model?

A

the test dataset

31
Q

How much of the data is typically used as validation dataset?

A

one-fourth to one-half of the data

32
Q

Which sampling method will ensure that he training and validation sets have an equal percentage of events?

A

stratified random sampling

33
Q

What are strata?

A

non-overlapping groups

34
Q

Which data set is used to calculate the parameter estimates?

A

the training dataset

35
Q

What does adding the OUTALL option on a PROC SURVEYSELECT do?

A

The OUTALL option returns the initial data set, augmented by a flag to indicate selection in the sample

36
Q

How can you produce the same split of the data on each run of PROC SURVEYSELECT

A

Set a value greater than 0 using the SEED= option

37
Q

What PROC can be used to verify the stratification in a sample?

A

PROC FREQ