How to Encode Categorical Data Flashcards
WHAT ARE THE TWO MOST POPULAR TECHNIQUES FOR ENCODING CATEGORICAL DATA? P258
The two most popular techniques are an Ordinal encoding and a One Hot encoding.
WHAT IS DISCRETIZATION? P259
Converting a numerical variable to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin.
WHAT IS THE DIFFERENCE BETWEEN NOMINAL VARIABLE AND ORDINAL VARIABLES? P259
There’s no rank-order between the values in nominal, but there is a rank-order in ordinal.
WHAT IS AN EXAMPLE OF AN ALGORITHM CAPABLE OF WORKING WITH CATEGORICAL DATA DIRECTLY? P259
Decision Trees; it can be learned directly from categorical data with no data transform required (this depends on the specific implementation)
WHAT IS THE DIFFERENCE BETWEEN LABEL ENCODER AND ORDINAL ENCODER? P260
Label encoder expects 1-D input but OrdinalEncoder can receive a matrix, other than this, they do the same thing.
WHY ORDINAL ENCODER CAN CAUSE PROBLEMS IF USED FOR NOMINAL VARIABLES? WHAT CAN BE USED INSTEAD? P260
An integer ordinal encoding is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one hot encoding may be used instead.
WHAT IS THE USE OF PARAMETER “CATEGORIES” IN ONE HOT ENCODER? P262
If you know all of the labels to be expected in the data, they can be specified via the categories argument as a list.
WHAT DOES ONE HOT ENCODER DO WHEN IT ENCOUNTERS UNKNOWN CATEGORIES IN NEW DATA? P262
If new data contains categories not seen in the training dataset, the handle unknown argument can be set to ‘ignore’ to not raise an error, which will result in a zero value for each label
WHAT’S ONE WAY TO REDUCE REDUNDANCY WHEN USING ONE HOT ENCODER? P262
Using dummy encoding
HOW DOES DUMMY ENCODER WORK? P262
When there are C categories, it creates C-1 column. One category gets all 0 values. . For example, if we know that [1, 0, 0] represents blue and [0, 1, 0] represents green we don’t need another binary variable to represent red, instead we could use 0 values alone, e.g. [0, 0, 0].
WHEN WORKING WITH TREE-BASED MODELS, IS FULL ONE HOT ENCODING BETTER OR IS DUMMY ENCODING BETTER? P262
We recommend using the full set of dummy variables when working with tree-based models.
HOW CAN WE IMPLEMENT DUMMY ENCODING USING ONE HOT ENCODER CLASS? P262
The “drop” parameter can be set to indicate which category will become the one that’s assigned all zero values.
WHAT IS THE CATEGORY THAT’S ASSIGNED ALL ZEROS CALLED? P262
Baseline
WHAT IF I HAVE A MIXTURE OF CATEGORICAL AND ORDINAL DATA? P270
You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model. Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables
WHAT IF I HAVE HUNDREDS OF CATEGORIES? P270
You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.