How to Encode Categorical Data Flashcards

Question 1

Q

WHAT ARE THE TWO MOST POPULAR TECHNIQUES FOR ENCODING CATEGORICAL DATA? P258

Answer

A

The two most popular techniques are an Ordinal encoding and a One Hot encoding.

Question 2

Q

WHAT IS DISCRETIZATION? P259

Answer

A

Converting a numerical variable to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin.

Question 3

Q

WHAT IS THE DIFFERENCE BETWEEN NOMINAL VARIABLE AND ORDINAL VARIABLES? P259

Answer

A

There’s no rank-order between the values in nominal, but there is a rank-order in ordinal.

Question 4

Q

WHAT IS AN EXAMPLE OF AN ALGORITHM CAPABLE OF WORKING WITH CATEGORICAL DATA DIRECTLY? P259

Answer

A

Decision Trees; it can be learned directly from categorical data with no data transform required (this depends on the specific implementation)

Question 5

Q

WHAT IS THE DIFFERENCE BETWEEN LABEL ENCODER AND ORDINAL ENCODER? P260

Answer

A

Label encoder expects 1-D input but OrdinalEncoder can receive a matrix, other than this, they do the same thing.

Question 6

Q

WHY ORDINAL ENCODER CAN CAUSE PROBLEMS IF USED FOR NOMINAL VARIABLES? WHAT CAN BE USED INSTEAD? P260

Answer

A

An integer ordinal encoding is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one hot encoding may be used instead.

Question 7

Q

WHAT IS THE USE OF PARAMETER “CATEGORIES” IN ONE HOT ENCODER? P262

Answer

A

If you know all of the labels to be expected in the data, they can be specified via the categories argument as a list.

Question 8

Q

WHAT DOES ONE HOT ENCODER DO WHEN IT ENCOUNTERS UNKNOWN CATEGORIES IN NEW DATA? P262

Answer

A

If new data contains categories not seen in the training dataset, the handle unknown argument can be set to ‘ignore’ to not raise an error, which will result in a zero value for each label

Question 9

Q

WHAT’S ONE WAY TO REDUCE REDUNDANCY WHEN USING ONE HOT ENCODER? P262

Answer

A

Using dummy encoding

Question 10

Q

HOW DOES DUMMY ENCODER WORK? P262

Answer

A

When there are C categories, it creates C-1 column. One category gets all 0 values. . For example, if we know that [1, 0, 0] represents blue and [0, 1, 0] represents green we don’t need another binary variable to represent red, instead we could use 0 values alone, e.g. [0, 0, 0].

Question 11

Q

WHEN WORKING WITH TREE-BASED MODELS, IS FULL ONE HOT ENCODING BETTER OR IS DUMMY ENCODING BETTER? P262

Answer

A

We recommend using the full set of dummy variables when working with tree-based models.

Question 12

Q

HOW CAN WE IMPLEMENT DUMMY ENCODING USING ONE HOT ENCODER CLASS? P262

Answer

A

The “drop” parameter can be set to indicate which category will become the one that’s assigned all zero values.

Question 13

Q

WHAT IS THE CATEGORY THAT’S ASSIGNED ALL ZEROS CALLED? P262

Question 14

Q

WHAT IF I HAVE A MIXTURE OF CATEGORICAL AND ORDINAL DATA? P270

Answer

A

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model. Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables

Question 15

Q

WHAT IF I HAVE HUNDREDS OF CATEGORIES? P270

Answer

A

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

How to Encode Categorical Data Flashcards

(15 cards)