Feature Engineering Flashcards
Feature Selection Methods
Filter methods
* 2. Advantages
* 3. Disadvantages
* 4. Examples
Filter methods
Filter methods are the simplest type of feature selection method. They work by filtering features prior to model building based on some criteria.
Advantages
* They are computationally inexpensive, since they do not involve testing the subsetted features using a model.
* They can work for any type of machine learning model.
Disadvantages
* It is more difficult to take multivariate relationships into account because we are not evaluating model performance. For example, a variable might not have much predictive power on its own, but can be informative when combined with other variables.
* They are not tailored toward specific types of models.
Examples
* Variance thresholds
* Correlation
* Mutual information
Feature Selection Methods
Wrapper methods
* Advantages
* Disadvantages
* Examples
Wrapper methods involve fitting a model and evaluating its performance for a particular subset of features. They work by using a search algorithm to find which combination of features can optimize the performance of a given model.
Advantages
* They can determine the optimal set of features that produce the best results for a specific machine learning problem.
* They can better account for multivariate relationships because model performance is evaluated.
Disadvantages
* They are computationally expensive because the model needs to be re-fitted for each feature set being tested.
Examples
* Forward/backward/bidirectional sequential feature selection
* Recursive feature elimination
Feature Selection Methods
Embedded methods
Embedded methods also involve building and evaluating models for different feature subsets, but their feature selection process happens at the same time as their model fitting step.
Advantages
* Like wrapper methods, they can optimize the feature set for a particular model and account for multivariate relationships.
* They are also generally less computationally expensive because feature selection happens during model training.
Examples
* Regularization (e.g., lasso/ridge regression)
* Tree-based feature importance
Variance threshold
One of the most basic filter methods is to use a variance threshold to remove any features that have little to no variation in their values. This is because features with low variance do not contribute much information to a model. Since variance can only be calculated on numeric values, this method only works on quantitative features. That said, we may also want to remove categorical features for which all or a majority of the values are the same. To do that, we would need to dummy code the categorical variables first, but we won’t demonstrate that here.
Now, we’ll be able to use the VarianceThreshold class from scikit-learn to help remove the low-variance features from X_num. By default, it drops all features with zero variance, but we can adjust the threshold during class instantiation using the threshold parameter if we want to allow some variation. The .fit_transform() method returns the filtered features as a numpy array:
variance threshold 保存为df
Pearson’s correlation
Another type of filter method involves finding the correlation between variables. In particular, the Pearson’s correlation coefficient is useful for measuring the linear relationship between two numeric, continuous variables — a coefficient close to 1 represents a positive correlation, -1 represents a negative correlation, and 0 represents no correlation. Like variance, Pearson’s correlation coefficient cannot be calculated for categorical variables. Although, there is a related point biserial correlation coefficient that can be computed when one variable is dichotomous, but we won’t focus on that here.
There are 2 main ways of using correlation for feature selection — to detect correlation between features and to detect correlation between a feature and the target variable.
Pearson’s correlation
Correlation between features
When two features are highly correlated with one another, then keeping just one to be used in the model will be enough because otherwise they provide duplicate information. The second variable would only be redundant and serve to contribute unnecessary noise.
To determine which variables are correlated with one another, we can use the .corr() method from pandas to find the correlation coefficient between each pair of numeric features in a DataFrame. By default, .corr() computes the Pearson’s correlation coefficient, but alternative methods can be specified using the method parameter. We can visualize the resulting correlation matrix using a heatmap:
Mutual information
The final filter method we’ll look at is using mutual information to rank and select the top features. Mutual information is a measure of dependence between two variables and can be used to gauge how much a feature contributes to the prediction of the target variable. It is similar to Pearson’s correlation, but is not limited to detecting linear associations. This makes mutual information useful for more flexible models where a linear functional form is not assumed. Another advantage of mutual information is that it also works on discrete features or target, unlike correlation. Although, categorical variables need to be numerically encoded first.
In our example, we can encode the edu_goal column using the LabelEncoder class from scikit-learn‘s preprocessing module: