How to Transform Numerical to Categorical Data: Suitable For Highly Skewed or Non-Standard Distribution Flashcards
WHAT DO DISCRETIZATION TRANSFORMS DO? P303
Discretization transforms are a technique for transforming numerical input or output variables to have:
Discrete ordinal labels
Different data distribution
WHICH LIBRARY DO WE USE FOR CHANGING THE STRUCTURE AND DISTRIBUTION OF NUMERIC VARIABLES TO CATEGORICAL TO IMPROVE THE PERFORMANCE OF PREDICTIVE MODELS? P303
KBinsDiscretizer
WHAT ARE 3 COMMON METHODS WE CAN USE FOR GROUPING VALUES INTO K DISCRETE BINS? P305
ˆ Uniform: All bins in each feature have identical widths.
ˆ Quantile: All bins in each feature have the approximately same number of points.
ˆ Kmeans: Clusters are identified and examples are assigned to each group.
WHAT ARE THE VALUES WE CAN USE FOR ‘STRATEGY’ PARAMETER OF KBINSDISCRETIZER? P305
Uniform, quantile, kmeans.
WHAT DOES N_BINS PARAMETER MEAN, IN KBINSDISCRETIZER BASED ON WHAT IS IT SET? P305
It controls the number of bins that will be created.
It must be set based on the choice of strategy.
HOW IS THE N_BINS CHOSEN FOR DIFFERENT STRATEGIES IN KBINSDISCRETIZER? P305
Uniform: flexible
Quantile: n_bins less than the number of observations or sensible percentile.
K-means: a value for the number of clusters that can be reasonably found
WHAT DOES ENCODE ARGUMENT CONTROLS IN KBINSDISCRETIZER? P305
It controls whether the transform will map each values to an integer value by setting “ordinal” or a one hot encoding “onehot”.
WHICH METHOD OF ENCODING IS PREFERRED IN KBINSDISCRETIZER? P305
Ordinal
WHEN DO WE USE ONEHOT, FOR ENCODE PARAMETER IN KBINSDISCRETIZER? P305
For example in the case of k-means clustering strategy
WHAT SORT OF RELATIONSHIPS CAN A MODEL LEARN WHEN WE USE ONEHOT ENCODING? P305
Non-ordinal relationships
WHICH RANGE IS SUITABLE FOR K-MEANS STRATEGY’S N_BINS IN KBINSDISCRETIZER? P312
3-5, unless the empirical distribution of the variable is complex.
DOES UNIFORM DISCRETIZATION TRANSFORM CHANGE THE PROBABILITY DISTRIBUTION? P309
No, Me: because the bin widths are all the same, it’s like creating a type of histogram
WHAT RANGE OF NUMBER IS SUITABLE FOR N_BINS OF QUANTILE STRATEGY IN KBINSDISCRETIZATION TRANSFORM? P315
5-10, unless there are a large number of observations or a complex empirical distribution, this number should be kept small.