Feature Selection Flashcards
DS
When should we reduce the number of features used by a model?
Some instances when feature selection is necessary:
- When there is strong collinearity between features
- There are an overwhelming number of features
- There are not enough computational resources to process all features
- The algorithm forces the model to use all features, even when they are not useful (most often in parametric or linear models)
- When we wish to make the model simpler for any reason; e.g. easier to explain, less computationally intensive, etc
When is feature selection unnecessary?
Some instance when feature selection is unnecessary:
- there are relatively few features
- All features contain useful and important signal
- There is no collinearity between features
- The model will automatically select useful features
- The computing resources can handle processing of all of the features
- Thoroughly explaining the model to non-technical audience is not critical
What are three types of feature selection methods?
Filter Methods: feature selection is done independent of the learning algorithm, before any modeling is done. One example is finding the correlation between every feature and the target and throwing out those that don’t meet a threshold. This is easy, fast but naive and not as performant as other models.
Wrapper Methods: train models on subsets of the features and use the subset that results in the best performance. Examples are Stepwise or Recursive feature selection. Advantages are that it considers each feature in the context of other features, but it can be computationally expensive.
Embedded Methods: learning algorithms which have built-in feature selection; e.g. Lasso L1 regularization does selection.
What are two common ways to automate hyperparameter tuning?
- GridSearch: test every possible combination of predefined hyperparameter values and select the best one.
- Randomized Search: randomly test possible combinations of predefined hyperparameter values and select the best tested one
What are the pros and cons of GridSearch?
PROS: GridSearch is great when we need to fine tune hyperparameters over a small search space automatically. e.g., if we have 100 different data setsthat we expect to be similar (such as solving same problem repeatedly with different populations) we can use GridSearch to fine tune hyperparameters for each model.
CONS: GridSearch is computationally expensive and inefficient over a parameter space that has very little chance of being useful, resulting in it being very slow. It;s especially slow if we need to search over a large space since its complexity increases exponentially as hyperparameters are optimized.
What are the pros and cons of randomized search?
PROS: randomized search does a good job of finding near-optimal hyperprameters over a large search space relatively quickly and doesn’t suffer from the same exponential scaling problem as GridSearch.
CONS: randomized search does not fine-tune the results as much as grid search since it typically does not test every possible combination of parameters.
Explain how to select the best model using exhaustive search
If we need to select the best model by searching over a range of hyperparameters, we can use sklearn GridSearchCV to do brute-force model selection using CV, where sklearn trains a model using EVERY VALUE AND/OR EVERY POSSIBLE VALUE.
e.g. say we have a logistic clf and want to tune its inverse of regularization strength hyperparam C = np.logspace(0,4,10) and its penalty hyperparam options [“L1”, “L2”].
If we train with cv=5, since there are 10 values for C, 2 values for penalty, and 5 iterations of CV, there are a total of 1025 = 100 possible candidate models to run.
By default, after fitting a GridSearchCV object and identifying best hyperparameters, GridSearch will retrain a model using the best params on the ENTIRE dataset (rather than leaving a fold out for CV) and we can use this model to predict values like any other sklearn model would:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# init clf logistic = LogisiticRegression()
# create range of hyperparams penalty = ["L1", "L2"] C = np.logspace(0,4,10)
# create a dict of hyperparam candidates hyperparameters = dict(C=C, penalty=penalty)
# create GS object gs = GridSearchCV(logistic, hyperparameters, CV=5)
# fit gs best_model = gs.fit(Xtrain, ytrain)
# view best hyperparameters best_penalty = best_model.best_estimator_.get_params()["penalty"]
best_C = best_model.best_estimator_.get_params()[“C”]
# predict target vector best_model.predict(Xtest)
note: GridSearchCV parameter verbose is useful/reassuring when dealing with long searching/training to receive status that the search is progressing.
Explain how to do computationally cheaper hyperparameter search than GridSearchCV
We can use sklearn RandomizedSearchCV to randomly sample without replacement hyperparameter values from a defined (scipy) distribution.
And just like with GridSearchCV, after the search is complete, RandomizedSearchCV automatically fits a new model using the best hyperparameters on the entire data set, which we can then use like any other sklearn model to make predictions.
from scipy import uniform
from sklearn.model_selection import RandomizedSearchCV
# create clf clf = LogisticRegression()
# create a range of candidate regularization penalties penalty = ["l1", "l2"]
# create a distribution of candidate regularization hyper values C = uniform(loc=0, scale=4)
# assign hyperparameter options hypers = dict(C=C, penalty=penalty)
# create randomized search obj rs = RandomizedSearchCV( clf, hypers, random_state=1, n_iter=100, cv=5, verbose-0, n_jobs=-1)
# fit and return best model best_model = rs.fit(Xtrain, ytrain)
# view best hyperparams print('best penalty:', best_model.best_estimator_.get_params())
# predict target vector yhat = best_model.predict(Xtest)
Explain how to select best models from multiple algorithms
If we are not certain which learning algo to use, we can simply define a search space in gridsearch which includes multiple learning algos, where each algo includes its own hyperprameter search space by using the format “classifier__[hyperparameter name]”.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# set random seed np.random.seed(0)
# load data iris = datasets.load_iris() X = iris.data y = iris.target
# init a pipe with a generic named_step "clf" which will hold as clf for any clf # when overloading search_space variable
pipe = Pipeline([(‘clf’,RandomForestClassifier())])
create a dict with candidate algos:hypers
search_space = [{‘clf’:[LogisticRegression(‘sag’)],
‘clf__penalty’:[‘l1’,’l2’],
‘clf__C’:np.logspace(0,4,10)},
{‘clf’:[RandomForestClassifier()],
‘clf__n_estimators’:[10,100,1000], # number of subtrees to grow
‘clf__max_features’:[1,2,3]}] # max features p to select when splitting nodes
# create GS object with pipe as clf gs = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# fit and auto-obtain best clf and respective hypers best_model = gs.fit(X,y)
# show best model display(best_model.best_estimator_)
# make predictions with best model best_model.predict(X,y)
Explain how data preprocessing affects feature selection/determining the best model
When we want to include a preprocessing step during model selection, we can create a sklearn pipeline that includes the preprocessing step and any of its parameters.
Why do this?
Because we must preprocess the data BEFORE training any models. This applies to model selection via GridSearch, as GS uses CV to determine which model has the highest performance.
In CV, we are in effect pretending that each held-out fold is effectively a test set that is NOT PREVIOUSLY SEEN, and thus each out-fold is NOT part of any preprocessing steps (e.g. scaling, etc).
- For this reason we CANNOT preprocess the data and then run GridSearchCV, rather, the preprocessing steps must be part of the set of actions taken by GridSearchCV.
This seems complicated, however sklearn FeatureUnion method allows us to combine multiple preprocessing steps, BOTH StandardScaler and PCA into a single prepreocess object. We then include this preprocess object in the pipeline object along with the learning algo.
The end result is this allows us to outsource the proper (and confusing) handling of fitting, transforming and training the models with combinations of hyperparameters to sklearn.
Second, some preprocessing methods have their own parameters, which often need to be supplied by the user. e.g. PCA requires to define the number of PCs to use to produce the transformed a feature set to lower dims. Ideally, we choose the number of components that produces a model with greatest performance for some evaluation metric. Luckily, sklearn makes this easy.
When we include candidate PCA components in the search_space, they are treated like any other hyper to be searched over. In our solution, we defined features__pca__n_components: [1,2,3] to indicate we want to determine whether 1,2,or 3 PCs results in the best model.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
iris= datasets.load_iris() X = iris.data y = iris.target
create a preprocessing object that includes StandardScaler features and PCA
preprocess = FeatureUnion([
(‘stdscale’,StandardScaler()),
(‘pca’,PCA())
])
create a pipeline
pipe = Pipeline([
(‘preprocess’,preprocess),
(‘clf’,LogisticRegression())
])
# create candidate search hypers
search_space = [
{‘preprocess__pca__n_components’:[1,2,3],
‘clf__penalty’:[‘l1’,’l2’],
‘clf__C’:np.logspace(0,4,10)
}
]
# create gridsearch obj
gs = GridSearchCV(pipe,search_space,cv=5,verbose=0,n_jobs=-1)
# auto-create best clf best_estimator = gs.fit(X,y)
Which parameter in sklearn enables us to speed up training for model selection?
In the real world, we will often have many to tens of thousands of models to train. To speed up the process sklearn allows us to train multiple models simultaneously–up to the number of cores in the machine. Most modern laptops have 4 cores. This will dramatically increase the speed of our model selection process. The parameter, n_jobs, defines the number of models to train in parallel. Setting n_jobs=-1 tells sklearn to use all available cores to run parallel training jobs.
gs = GridSearchCV(clf, hypers, cv=5, n_jobs=-1) best_clf = gs.fit(X, y)
How does one speed up model selection for a specific algorithm?
If we are using a select number of modelslearning algos, we can use sklearn’s model-specific CV for hyperparameter tuning.
Sometimes the characteristics of a learning algo allow us to search for the best hypers significantly faster than either brute-force or randomized search.
In sklearn, many learning algos (ridge, lasso, elasticnet regressions) have an algo-specific CV method available to take advantage of this.
e.g., LogisticRegressionCV implements an efficient cross-validated logistic reg clf that has the ability to identify the optimum value of the hyperparam C. LogisticRegressionCV has a parameter Cs. If supplied a list, Cs is the candidate hyper values to select from. If supplied an int, Cs generates a list of the int number of candidate values drawn logarithmically from a range in (0.0001, 1000) which is a range of reasonable values for hyper C.
A major downside of LogisticRegressionCV is that it can only search a range of values for one hyper, “C”, and ignores checking for the other possible values forthe hyperparam “penalty”. This limitation is common to many of sklearn’s model-specific-CV methods.
from sklearn import linear_model
# create CV'd logistic logit = linear_model.LogisticRegressionCV(Cs=100)
# train model logit.fit(X, y)
Explain how to evaluate performance after model selection
Let “model selection” be the skearn method GridSearchCV, which automatically returns the best CV’d estimator/clf.
To evaluate the above “best model”, embed the best_model object into cross_val_score method to apply NESTED CROSS-VALIDATION.
Recall in k-fold CV, we train our model on k-1 folds of the data, use this model to make predictions on the remaining fold, and then eval our model on how well our model’s predictions compare to the true values–we then repeat this process k times.
Also recall that we select best algo hypers (model selection) by applying Grid/RandomizedSearchCV to eval algos vs. hypers by CV.
A NUANCED and underappreciated CONFLICT arises: since we ALREADY used the data to select the best hyper values in GridSearchCV, we CANNOT use the that SAME DATA to eval the best_model’s performance.
The solution?
WRAP THE GridSearchCV used to search hypers for a best model selection inside another CV method for eval, namely cross_val_score!
In NESTED CV, the “inner” CV selects the best model and the “outer” CV provides UNBIASED EVALUATION of the model’s performance.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score
# init clf logistic = LogisticRegression()
# create range of 20 candidate values for hyper C C = np.logspace(0,4,20)
# create hyperparam options hypers = dict(C=C)
# create GridSearch object-only gs = GridSearchCV(logistic, hypers, cv=5, n_jobs=-1)
# nest the gs model selection method inside eval method mean_acc = cross_val_score(gs, X, y).mean()