Understanding Predictive Modeling Flashcards
What is oversampling?
disproportionately over-represents the event cases (for example, an equal number of events and non-events)
What is a use case for oversampling?
This is typically done when the original data set is very large and the ratio of events to non-events is very small.
Name four common data-related challenges
observational data
mixed measurement scales
high dimensionality
rare target events
What is the curse of dimensionality?
High dimensionality limits your ability to explore and model the relationships among the variables.
What are two issues associated with the curse of dimensionality?
The number of variables may affect computational performance more than the number of cases
Including more variables makes the values more spread out, making it difficult to identify relationships in the data
What is a biased sample?
a sample that is produced by oversampling
What is meant by observational data?
Data gathered for some purpose other than data analysis, i.e. operational data
What are some common problems found in operational data?
errors
missing data
redundant variables
Name three analytical challenges:
nonlinearities
interactions
model selection
What is the process of choosing the model with the highest predictive value accuracy?
Model selection
What does it mean to overfit the data?
Using an overly complex model that is too sensitive to peculiarities in the sample data set that cannot generalize well to new data
What does it mean to underfit the data?
Underfitting the data occurs when the model is too simple and systematically misses the true features in the data.
When is joint sampling appropriate?
When the target event is not rare
What is joint sampling?
Joint sampling selects a representative sample of the data by randomly selecting input-target pairs.
Which sampling method would you choose when your event is not rare?
Joint sampling