2 Preprocessing Flashcards
The below figure provides information on the housing variable in a credit risk dataset, with pie charts for the full customer base and separately for the good (non-defaulted) and bad (defaulted) observations. Which statement about the below figure is TRUE?
Goods: little rent and little for free
Bads: more rent more for free
a) The housing variable is predictive for the target variable, since the distributions of the goods and bads are different.
b) Default risk for a customer living for free is lower than for a customer who is an owner.
c) A good customer is less likely to be an owner than a renter.
d) A bad customer is more likely to be an owner than a renter and therefore an owner is more likely to be bad.
a) The housing variable is predictive for the target variable, since the distributions of the goods and bads are different.
Which of the following statements is TRUE? Give answer D if all statements are false.
a) When the class distribution is heavily skewed, you will automatically obtain more reliable or robust weights-of-evidence values if more observations are used for calculating weights-of-evidence values.
b) When applying weights-of-evidence, you can expect to lose predictive power, no matter which classification method is applied.
c) If we can have at most five variables in our final model, then selecting the five variables with the highest information value will lead to the best predictive performance of the resulting model.
d) All the above statements are false.
d) All the above statements are false.
a) reliabilty is not solley determined by number of variables -> skewed data, make sure to use appropriate model
b) weights of evidence improves predictive power bcs uses relative proportions
c) avoid using all variables -> problem of multincullinearity
Why do categorical variables need to be transformed into continuous variables? Is that always necessary? Discuss using an example.
enhance the accuracy and performance of the model.
It is not always necessary, and the choice depends on the nature of the algorithm being used.
Why do continuous variables need to be transformed into categorical variables? Is that always necessary? Discuss using an example.
why bcs of (1) techniques which require variable to be categorical.
Or (2) interpretability requirement, eg. sometimes ages are segmented to make interpretability easier.
(3) sometimes you want to incorporate non-linear models into a linear model
Explain why and when the log transformation is used?
to reduce skewness of a measurement variable eg. income
helps the model behave better
What is in your opinion the best approach to handle missing values?
keepit as MV (informative)
Why is it important to detect and treat outliers in preprocessing a data set? Is it always necessary or recommendable to do that? Discuss using an example.
they can negatively affect the statistical analysis and the training process of a machine learning algorithm resulting in lower accuracy.
yes
eg. loan application allows to find fraud