Understanding Predictive Modeling Flashcards
What is oversampling?
disproportionately over-represents the event cases (for example, an equal number of events and non-events)
What is a use case for oversampling?
This is typically done when the original data set is very large and the ratio of events to non-events is very small.
Name four common data-related challenges
observational data
mixed measurement scales
high dimensionality
rare target events
What is the curse of dimensionality?
High dimensionality limits your ability to explore and model the relationships among the variables.
What are two issues associated with the curse of dimensionality?
The number of variables may affect computational performance more than the number of cases
Including more variables makes the values more spread out, making it difficult to identify relationships in the data
What is a biased sample?
a sample that is produced by oversampling
What is meant by observational data?
Data gathered for some purpose other than data analysis, i.e. operational data
What are some common problems found in operational data?
errors
missing data
redundant variables
Name three analytical challenges:
nonlinearities
interactions
model selection
What is the process of choosing the model with the highest predictive value accuracy?
Model selection
What does it mean to overfit the data?
Using an overly complex model that is too sensitive to peculiarities in the sample data set that cannot generalize well to new data
What does it mean to underfit the data?
Underfitting the data occurs when the model is too simple and systematically misses the true features in the data.
When is joint sampling appropriate?
When the target event is not rare
What is joint sampling?
Joint sampling selects a representative sample of the data by randomly selecting input-target pairs.
Which sampling method would you choose when your event is not rare?
Joint sampling
Which sampling method would you choose when your event is rare?
Separate sampling (which is used to oversample the data)
What sampling method creates a target-based sample is created by drawing samples separately based on the target outcome – that is, whether it is a non-event or an event?
Separate sampling
What is the optimism principle?
The optimism principle states that when you assess the accuracy of a predictive model on the same data that was used to fit the model, you tend to get better assessment statistics than when you assess the model on other data.
What does a large differences between the performance on the training and test data sets usually indicate?
overfitting / optimism bias
What should you be concerned with as underlying model becomes more flexible and the data less plentiful?
overfitting
Name two activities that increase the risk of overfitting?
variable selection methods and supervised input preparation (such as collapsing levels of nominal variables based on associations with the target)
What is optimism bias?
Overfitting the data causing large differences between the performance on the training and test data sets
What is assessing the performance of a model on new data that was not used to fit the original model referred to as?
honest assessment
What are some methods of honest assessment?
Splitting the data, k-fold validation, and bootstrapping
What is the simplest way to do an honest assessment of how well your model generalizes to new data?
Split your data into two data sets: a training data set and a validation data set
How does splitting the data allow you to avoid optimism bias?
Because you are assessing your model on different cases than you used to fit the model, you avoid the optimism bias and get a valid assessment measurement.
Which dataset is used to assess and compare models?
the test dataset
Which dataset is used to fit the model?
the training dataset
Which dataset is used for comparing, selecting, and tuning models?
the validation dataset
Which dataset is used for the final assessment of a model?
the test dataset
How much of the data is typically used as validation dataset?
one-fourth to one-half of the data
Which sampling method will ensure that he training and validation sets have an equal percentage of events?
stratified random sampling
What are strata?
non-overlapping groups
Which data set is used to calculate the parameter estimates?
the training dataset
What does adding the OUTALL option on a PROC SURVEYSELECT do?
The OUTALL option returns the initial data set, augmented by a flag to indicate selection in the sample
How can you produce the same split of the data on each run of PROC SURVEYSELECT
Set a value greater than 0 using the SEED= option
What PROC can be used to verify the stratification in a sample?
PROC FREQ