Transformers Flashcards
How do you create your own custom transformer the “easy way”, without creating it directly as a class?
from sklearn.preprocessing import FunctionTransformer
my_transformer = FunctionTransformer(my_custom_function)
What does StandardScaler do?
Transforms every feature (predictor) in X by subtracting the feature’s mean and dividing by the feature’s SD. Essentially, turns the feature into a Z-score.
What does MaxAbsScaler do?
Transforms every feature (predictor) in X by dividing it by the feature’s max value (equivalent to setting the feature’s max at 1).
Syntax for importing any transformer?
from sklearn.preprocessing import …
Provide an example where adding a new feature into the model actually reduces its performance (i.e., R-squared)
Existing features were all binary 0/1s. The new feature is on a much larger scale (e.g., Unix timestamp) and wasn’t scaled, so it dwarfs the existing features and drastically reduces their predictive power.
What is a sparse matrix and how does it look?
A sparse matrix represents the same set X of features as a regular matrix that has lots of 0s and only a few non-0s. Sparse matrix consists of tuples of cell locations (ilocs) that have non-zero values, along with what that non-zero value is. All other cell ilocs are implicitly assumed to contain 0s.
You want to feed X into an sklearn ML model but X has some missing data. What do you do?
sklearn can’t handle missing data, so you have to drop or impute the missing data. Use SimpleImputer() transformer.
What does SimpleImputer do?
It’s a transformer that fills in missing data according to one of several simple rules you can specify. For example:
Fill with mean
Fill with median
Fill with most frequent value
Fill with a specified constant
How do you turn 2+ separate categorical features at once into dummy columns?
OneHotEncoder transformer
What does OneHotEncoder do?
What is its default output?
What is a useful attribute associated with it?
It expands one or more categorical features into dummy column features.
Its default output is a sparse matrix, but that can be changed while defining the transformer.
.categories_ # lists all the categories it learned from the categorical features (e.g., if there were two categorical predictors with 3 and 5 values, respectively, this will output their combined 8 values, which are now 8 new dummy columns)
How do you transform all the columns in X in one step?
What is the syntax?
from sklearn.compose import ColumnTransformer
Feed ColumnTransformer with a list of tuples of the form (custom_name, transformer, list_of_cols)
my_transformer = ColumnTransformer([(‘onehot’, OneHotEncoder(), list_of_categorical_cols), (‘z_scorer’, StandardScaler(), list_of_numerical_cols)], remainder=’passthrough’) # remainder could also be a transformer name too
X_trans = my_transformer.fit_transform(X)
How do you access the details of one of the transformers within a ColumnTransformer?
.named_transformers_ # provides the step names that I created
.named_transformers_[some_step_name] # accesses the details of that actual specific transformer; i.e., works like a dictionary
What are the methods associated with transformers? How do you use them (syntactically-wise)?
.fit(), .transform(), and fit_transform()
scaler = StandardScaler()
scaler.fit(X) X_trans = scaler.transform(X) # OR: X_trans = scaler.fit_transform(X)
How to use the .fit() and .transform() methods as they related to the training vs test datasets? Plz explain.
Use .fit() ONLY on the training data (sub)set.
Use .transform() on both the training and test data (sub)sets.
Assumption is that the fundamental distribution/shape of the data is the same across both. This is a legit assumption since both are randomly chosen chunx of your actual dataset.