Data Procesing Flashcards
Data adequacy
Historical data must reflect future behavior
Sample must be representative
Older data less relevant
Impactful events should be noted
Be aware of sampling bias
Convert numeric to factor
Yes if:
-Variables has a small # of distinct values.
-Variable values are merely numeric labels (bo sense of numeric order, group no)
-Variables has a complex relationship with target → factor conversion gives models more flexibility to capture relationships
No if:
-Monotonic relationship with target. Its effect can be captured by treating as numeric.
-Values have a sense of numeric order that might be helpful to predict target
-Variables has a large no of distinct values, ex hours of day (would cause high dimension and overfitting if converted)
-Future observation will have new variables values (calendar year)
Sparse level for categorical predictors
Combine levels where the target variable behaves similarity to form more representative and interpretable groups
AUC ( area under the curve)
Measures Accuracy in classification problems
Model validation based on test data
Predicted us actual values of target: two sett should be close(can check quantitative or graphically)
Benchmark model: shows that the recommended model outperforms a benchmark model (intercept only glm, purely random classifier)
Handling outliers ( problems with skewness)
-Remove it
- keep it ( make up an insignificant proportion of the data)
- modify it (change negative value to zero)
- use robust model forms: fit models by minimizing the absolute error (instead of squared error) btw predicted and observed values.
Guassian
Symmetric and slows negative value
Continuous distribution
Normal→ all real #
Game- inverse G - y>0
Poisson
Count, frequency
Positive values
Tweedy
Continuous / discrete
Real + values