Understanding Predictive Modeling Flashcards by Nicole Fox

What is oversampling?

disproportionately over-represents the event cases (for example, an equal number of events and non-events)

How well did you know this?

Not at all

Perfectly

What is a use case for oversampling?

This is typically done when the original data set is very large and the ratio of events to non-events is very small.

How well did you know this?

Not at all

Perfectly

Name four common data-related challenges

observational data
mixed measurement scales
high dimensionality
rare target events

How well did you know this?

Not at all

Perfectly

What is the curse of dimensionality?

High dimensionality limits your ability to explore and model the relationships among the variables.

How well did you know this?

Not at all

Perfectly

What are two issues associated with the curse of dimensionality?

The number of variables may affect computational performance more than the number of cases

Including more variables makes the values more spread out, making it difficult to identify relationships in the data

How well did you know this?

Not at all

Perfectly

What is a biased sample?

a sample that is produced by oversampling

How well did you know this?

Not at all

Perfectly

What is meant by observational data?

Data gathered for some purpose other than data analysis, i.e. operational data

How well did you know this?

Not at all

Perfectly

What are some common problems found in operational data?

errors
missing data
redundant variables

How well did you know this?

Not at all

Perfectly

Name three analytical challenges:

nonlinearities
interactions
model selection

How well did you know this?

Not at all

Perfectly

What is the process of choosing the model with the highest predictive value accuracy?

Model selection

How well did you know this?

Not at all

Perfectly

What does it mean to overfit the data?

Using an overly complex model that is too sensitive to peculiarities in the sample data set that cannot generalize well to new data

How well did you know this?

Not at all

Perfectly

What does it mean to underfit the data?

Underfitting the data occurs when the model is too simple and systematically misses the true features in the data.

How well did you know this?

Not at all

Perfectly

When is joint sampling appropriate?

When the target event is not rare

How well did you know this?

Not at all

Perfectly

What is joint sampling?

Joint sampling selects a representative sample of the data by randomly selecting input-target pairs.

How well did you know this?

Not at all

Perfectly

Which sampling method would you choose when your event is not rare?

Joint sampling

How well did you know this?

Not at all

Perfectly

Which sampling method would you choose when your event is rare?

Study These Flashcards

Separate sampling (which is used to oversample the data)

What sampling method creates a target-based sample is created by drawing samples separately based on the target outcome – that is, whether it is a non-event or an event?

Study These Flashcards

Separate sampling

What is the optimism principle?

Study These Flashcards

The optimism principle states that when you assess the accuracy of a predictive model on the same data that was used to fit the model, you tend to get better assessment statistics than when you assess the model on other data.

What does a large differences between the performance on the training and test data sets usually indicate?

Study These Flashcards

overfitting / optimism bias

What should you be concerned with as underlying model becomes more flexible and the data less plentiful?

Study These Flashcards

overfitting

Name two activities that increase the risk of overfitting?

Study These Flashcards

variable selection methods and supervised input preparation (such as collapsing levels of nominal variables based on associations with the target)

What is optimism bias?

Study These Flashcards

Overfitting the data causing large differences between the performance on the training and test data sets

What is assessing the performance of a model on new data that was not used to fit the original model referred to as?

Study These Flashcards

honest assessment

What are some methods of honest assessment?

Study These Flashcards

Splitting the data, k-fold validation, and bootstrapping

What is the simplest way to do an honest assessment of how well your model generalizes to new data?

Split your data into two data sets: a training data set and a validation data set

How does splitting the data allow you to avoid optimism bias?

Because you are assessing your model on different cases than you used to fit the model, you avoid the optimism bias and get a valid assessment measurement.

Which dataset is used to assess and compare models?

the test dataset

Which dataset is used to fit the model?

the training dataset

Which dataset is used for comparing, selecting, and tuning models?

the validation dataset

Which dataset is used for the final assessment of a model?

the test dataset

How much of the data is typically used as validation dataset?

one-fourth to one-half of the data

Which sampling method will ensure that he training and validation sets have an equal percentage of events?

stratified random sampling

What are strata?

non-overlapping groups

Which data set is used to calculate the parameter estimates?

the training dataset

What does adding the OUTALL option on a PROC SURVEYSELECT do?

The OUTALL option returns the initial data set, augmented by a flag to indicate selection in the sample

How can you produce the same split of the data on each run of PROC SURVEYSELECT

Set a value greater than 0 using the SEED= option

What PROC can be used to verify the stratification in a sample?

PROC FREQ

Understanding Predictive Modeling Flashcards

(37 cards)