Transformers Flashcards

Question 1

Q

How do you create your own custom transformer the “easy way”, without creating it directly as a class?

Answer

A

from sklearn.preprocessing import FunctionTransformer

my_transformer = FunctionTransformer(my_custom_function)

Question 2

Q

What does StandardScaler do?

Answer

A

Transforms every feature (predictor) in X by subtracting the feature’s mean and dividing by the feature’s SD. Essentially, turns the feature into a Z-score.

Question 3

Q

What does MaxAbsScaler do?

Answer

A

Transforms every feature (predictor) in X by dividing it by the feature’s max value (equivalent to setting the feature’s max at 1).

Question 4

Q

Syntax for importing any transformer?

Answer

A

from sklearn.preprocessing import …

Question 5

Q

Provide an example where adding a new feature into the model actually reduces its performance (i.e., R-squared)

Answer

A

Existing features were all binary 0/1s. The new feature is on a much larger scale (e.g., Unix timestamp) and wasn’t scaled, so it dwarfs the existing features and drastically reduces their predictive power.

Question 6

Q

What is a sparse matrix and how does it look?

Answer

A

A sparse matrix represents the same set X of features as a regular matrix that has lots of 0s and only a few non-0s. Sparse matrix consists of tuples of cell locations (ilocs) that have non-zero values, along with what that non-zero value is. All other cell ilocs are implicitly assumed to contain 0s.

Question 7

Q

You want to feed X into an sklearn ML model but X has some missing data. What do you do?

Answer

A

sklearn can’t handle missing data, so you have to drop or impute the missing data. Use SimpleImputer() transformer.

Question 8

Q

What does SimpleImputer do?

Answer

A

It’s a transformer that fills in missing data according to one of several simple rules you can specify. For example:

Fill with mean
Fill with median
Fill with most frequent value
Fill with a specified constant

Question 9

Q

How do you turn 2+ separate categorical features at once into dummy columns?

Answer

A

OneHotEncoder transformer

Question 10

Q

What does OneHotEncoder do?

What is its default output?

What is a useful attribute associated with it?

Answer

A

It expands one or more categorical features into dummy column features.

Its default output is a sparse matrix, but that can be changed while defining the transformer.

.categories_ # lists all the categories it learned from the categorical features (e.g., if there were two categorical predictors with 3 and 5 values, respectively, this will output their combined 8 values, which are now 8 new dummy columns)

Question 11

Q

How do you transform all the columns in X in one step?

What is the syntax?

Answer

A

from sklearn.compose import ColumnTransformer

Feed ColumnTransformer with a list of tuples of the form (custom_name, transformer, list_of_cols)

my_transformer = ColumnTransformer([(‘onehot’, OneHotEncoder(), list_of_categorical_cols), (‘z_scorer’, StandardScaler(), list_of_numerical_cols)], remainder=’passthrough’) # remainder could also be a transformer name too

X_trans = my_transformer.fit_transform(X)

Question 12

Q

How do you access the details of one of the transformers within a ColumnTransformer?

Answer

A

.named_transformers_ # provides the step names that I created

.named_transformers_[some_step_name] # accesses the details of that actual specific transformer; i.e., works like a dictionary

Question 13

Q

What are the methods associated with transformers? How do you use them (syntactically-wise)?

Answer

A

.fit(), .transform(), and fit_transform()

scaler = StandardScaler()

scaler.fit(X)
X_trans = scaler.transform(X)
# OR: 
X_trans = scaler.fit_transform(X)

Question 14

Q

How to use the .fit() and .transform() methods as they related to the training vs test datasets? Plz explain.

Answer

A

Use .fit() ONLY on the training data (sub)set.
Use .transform() on both the training and test data (sub)sets.

Assumption is that the fundamental distribution/shape of the data is the same across both. This is a legit assumption since both are randomly chosen chunx of your actual dataset.

Transformers Flashcards

(14 cards)